Re: Origin of the U+nnnn notation

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Nov 08 2005 - 16:11:46 CST

  • Next message: Philippe Verdy: "Re: Origin of the U+nnnn notation"

    On this topic...

    > From: "Dominikus Scherkl" <lyratelle@gmx.de>

    speculated:

    > > Maybe it was thought of as an offset from the unit (character null)
    > > like in ETA+5 minutes (expected time of arrival was passed five minutes
    > > ago - an euphemism for beeing 5 minutes late).

    Perhaps, but it had nothing to do with the actual origin of the "+".

    And Philippe responded:

    > U-nnnn already exists (or I should say, it has existed).

    U+nnnn, actually. The U- notation was introduced by Amd 9 to 10646
    in 1997. It was never adopted for any use with Unicode, per se.

    > It was refering to
    > 16-bit code units,

    Code *points*, not code units. These were known as "Unicode values"
    in Unicode prior to the introduction of UTF-16.

    > not really to characters and was a fixed-width notation
    > (with 4 hexadecimal digits). The "U" meant "Unicode" (1.0 and before).
    >
    > U+[n...n]nnnn was created to avoid the confusion with the past 16-bit only
    > Unicode 1.0 standard (which was not fully compatible with ISO/IEC 10646 code
    > points).

    Actually, it was not to avoid confusion with Unicode 1.0. Unicode 1.1
    was also 16-bit only, and it was fully compatible with 10646-1:1993.

    > It is a variable-width notation that refers to ISO/IEC 10646 code
    > points. The "U" means "UCS" or "Universal Character Set". At that time, the
    > UCS code point range was up to 31 bits wide.
    >
    > The U-nnnn notation is abandoned now,

    It isn't in widespread usage, but is still a normative specification
    in 10646:2003.

    > except for references to Unicode 1.0.

    This is false. The "-" of the U- notation has nothing to do with
    Unicode 1.0.

    > If one uses it, it will refer to one or more 16-bit code units needed to
    > encode each codepoint (possibly with surrogate pairs). It does not
    > designates abstract characters or codepoints unambiguously.

    This is false. The U- notation is for the 8-digit short identifiers
    of 10646:2003. Those short identifiers designate code positions
    (10646 term for code points) unambiguously. From 10646:

      "ISO/IEC 10646 defines short identifiers for each code position,
       including code positions that are reserved. A short identifier
       for any code position is distinct from a short identifier for
       any other code position. ..."
       
    I'd say that's a pretty explicit claim that 10646 is talking about
    code points *and* that the short identifiers are unambiguous.

    > Later, the variable-width U+[n...n]nnnn notation was restricted to allow
    > only codepoints in the 17 first planes of the joined ISO/IEC 10646-1 and
    > Unicode standards (so the only standard codepoints are between U+0000 and
    > U+10FFFF, some of them being permanently assigned to non-characters).

    Correct. The current form of the specification is:

      "The four-to-six-digit form of short identifier shall consist
       of the last four to six digits of the eight-digit form. It is
       not defined if the eight-digit form is greater than 0010FFFF.
       Leading zeroes beyond four digits are suppressed."
       
    > There are '''no''' negative codepoints in either standards (U-0001 does not
    > designate the 32-bit code unit that you could store in a signed wide-char
    > datatype, but in past standard it designated the same codepoint as U+0001
    > now). Using "+" makes the statement about signs clear: standard code points
    > all have positive values.

    The "+" might connote that for some users, but its origin had nothing
    to do with that.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Nov 08 2005 - 16:13:16 CST