From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Nov 08 2005 - 16:11:46 CST
On this topic...
> From: "Dominikus Scherkl" <lyratelle@gmx.de>
speculated:
> > Maybe it was thought of as an offset from the unit (character null)
> > like in ETA+5 minutes (expected time of arrival was passed five minutes
> > ago - an euphemism for beeing 5 minutes late).
Perhaps, but it had nothing to do with the actual origin of the "+".
And Philippe responded:
> U-nnnn already exists (or I should say, it has existed).
U+nnnn, actually. The U- notation was introduced by Amd 9 to 10646
in 1997. It was never adopted for any use with Unicode, per se.
> It was refering to
> 16-bit code units,
Code *points*, not code units. These were known as "Unicode values"
in Unicode prior to the introduction of UTF-16.
> not really to characters and was a fixed-width notation
> (with 4 hexadecimal digits). The "U" meant "Unicode" (1.0 and before).
>
> U+[n...n]nnnn was created to avoid the confusion with the past 16-bit only
> Unicode 1.0 standard (which was not fully compatible with ISO/IEC 10646 code
> points).
Actually, it was not to avoid confusion with Unicode 1.0. Unicode 1.1
was also 16-bit only, and it was fully compatible with 10646-1:1993.
> It is a variable-width notation that refers to ISO/IEC 10646 code
> points. The "U" means "UCS" or "Universal Character Set". At that time, the
> UCS code point range was up to 31 bits wide.
>
> The U-nnnn notation is abandoned now,
It isn't in widespread usage, but is still a normative specification
in 10646:2003.
> except for references to Unicode 1.0.
This is false. The "-" of the U- notation has nothing to do with
Unicode 1.0.
> If one uses it, it will refer to one or more 16-bit code units needed to
> encode each codepoint (possibly with surrogate pairs). It does not
> designates abstract characters or codepoints unambiguously.
This is false. The U- notation is for the 8-digit short identifiers
of 10646:2003. Those short identifiers designate code positions
(10646 term for code points) unambiguously. From 10646:
"ISO/IEC 10646 defines short identifiers for each code position,
including code positions that are reserved. A short identifier
for any code position is distinct from a short identifier for
any other code position. ..."
I'd say that's a pretty explicit claim that 10646 is talking about
code points *and* that the short identifiers are unambiguous.
> Later, the variable-width U+[n...n]nnnn notation was restricted to allow
> only codepoints in the 17 first planes of the joined ISO/IEC 10646-1 and
> Unicode standards (so the only standard codepoints are between U+0000 and
> U+10FFFF, some of them being permanently assigned to non-characters).
Correct. The current form of the specification is:
"The four-to-six-digit form of short identifier shall consist
of the last four to six digits of the eight-digit form. It is
not defined if the eight-digit form is greater than 0010FFFF.
Leading zeroes beyond four digits are suppressed."
> There are '''no''' negative codepoints in either standards (U-0001 does not
> designate the 32-bit code unit that you could store in a signed wide-char
> datatype, but in past standard it designated the same codepoint as U+0001
> now). Using "+" makes the statement about signs clear: standard code points
> all have positive values.
The "+" might connote that for some users, but its origin had nothing
to do with that.
--Ken
This archive was generated by hypermail 2.1.5 : Tue Nov 08 2005 - 16:13:16 CST