Re: Origin of Ellipsis (was: RE: Empty set) from Philippe Verdy on 2013-09-16 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 16 Sep 2013 21:39:48 +0200

Nah!!! STRICTLY NOBODY counts "scalar values".

Every one counts either
- (a) code units (most often 8-bit bytes, more rarely 16-bit bytes e.g.
with basic Javascript code), or
- (b) code points (independantly of code units used in the storage or
communication message format).

The application *may* enforce a normalization form prior to counting this
(I'm not convinced/sure if Tweeter effectiely forces NFC prior to
trunctating messages, it just happens that most texts are composed already
in NFC form, and this is cauysed by keyboard drivers, or IME on the client
device or browser which highly favor the NFC form, and also because many
devices are still not able display properly some character clusters if they
are not in NFC form).

Stop speaking about scalar values, they are just meant for internal
arithmetics between distinct "abstract characters" (those that are given
unique code coints) or as internal mappings necessary for converting
between UTF's. But arbitrary arithmetic otherwise is completely unsafe and
ives unpredictable results, not warrantied or stabilized in the standard).

Also yes the term "character" alone is ambiguous, but "abstract character"
in the Unicode standard (even if in many occasions it is abbreviated to
just "character" **in this context* (but it then it may contradict other
definitions of "character" used for example in programming languages).

If you want to be clear only speak counting about counting

- "code points" (more or less the same as counting abstract characters,
except that you can count code points which are still not assigned to
abstract characters, or can also count code points assigned to
"non-characters", or even count code points that are assigned to surrogates
and that you may find in non-conforming documents supposed to be encoded in
UTF-32). Such count will be independant of the encoding. Code points are
noted U+nnnn.

- "code units" (but be more specific and explictly give its size). Such
counting will be fully dependant of the encoding. Code units are usually
noted with fixed-width hexadecimal values. Code units do NOT have a "scalar
value" in the same meaning as given in TUS. If you count code units, you
may also count some of them that have NO meaning in the standard UTF (or
legacy 8-bit encoding), such as an 8-bit code unit equal to 0xFF found in a
non-conforming UTF-8 string.

In all cases however the niormalization form may change the result of your
measurement. But technically even if texts are not normalized or are
normalized to distinct forms, if they are "canonically equivalent", they
are still not "equal', and it is notmal that your countings will given
different results. But note that it is not always possible to normalize
input documents (notably you may be able to measure these documents in code
points or in code units, even if they are not conforming to their supposed
UTF, but then any prior normalization of these non-conforming documents
will likely fail).

This also means that just counting code points or code units in an encoded
text is not a conforming process, unless your counting is performed after
first applying a (conforming) normalization. And such conforming counting
process is allowed to fail (and in fact should even fail with an error
returned if the document is not conforming to its assumed UTF, just like it
would fail if you converted it from/to a legacy encoding other than a
standard UTF).

Normalization should be perceived like a transcoding. Some normalizations
are conforming and will (should!) fail, some others are non-nonconforming
and will never fail, but you now the risks when using non conforming
processes because they create ambiguities (the same kind of ambiguities
that also occur when you just say you'll measure any "length" of a text,
when not being very specific about : what you are counting, in which
dimensional space, through which surjective projection(s), with which unit
of measure, and sometimes with which rounding mode if the returned measure
will have a limited precision)...

2013/9/16 Phillips, Addison <addison_at_lab126.com>

> Actually, that's my bad: I meant to type scalar value.
>
>
> Stephan Stiller <stephan.stiller_at_gmail.com> wrote:
>
> On 9/15/2013 3:07 PM, Phillips, Addison wrote:
>
> Not if the limit is counted in characters and not in bytes. Twitter, for
> example, counts code points in the NFC representation of a tweet.
>
> "character", "code point" – these are confusing words :-)
>
> From the link it isn't entirely clear whether they
> (a) count scalar values of NFC *or*
> (b) count code points of NFC.
>
> That's why I think it's bad to write "code point" when "scalar value" is
> intended.
>
> Stephan
>
>
Received on Mon Sep 16 2013 - 14:39:48 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 16 2013 - 14:42:40 CDT