RE: Origin of the U+nnnn notation

From: Peter Constable (petercon@microsoft.com)
Date: Wed Nov 09 2005 - 08:06:10 CST

  • Next message: Michael Kaplan: "RE: Origin of the U+nnnn notation"

    Philippe's response regarding U- notation, while well-meaning, is pretty much pure fiction.

    The U- notation is defined in ISO/IEC 10646. It always uses 8 hex digits, U-nnnnnnnn, and refers to a UCS-4 codepoint.

    Peter

    > -----Original Message-----
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    > Behalf Of Philippe Verdy
    > Sent: Tuesday, November 08, 2005 6:05 AM
    > To: Dominikus Scherkl; 'Jukka K. Korpela'; unicode@unicode.org
    > Subject: Re: Origin of the U+nnnn notation
    >
    > From: "Dominikus Scherkl" <lyratelle@gmx.de>
    > >> I have been unable to hunt down the historical origin of the
    > >> notation U+nnnn (where nnnn are hexadecimal digits) that we
    > >> use to refer to characters (and code points).
    > >> Presumably "U" stands for "UCS" or for "Unicode", but where
    > >> does the plus sign come from?
    > > Maybe it was thought of as an offset from the unit (character null)
    > > like in ETA+5 minutes (expected time of arrival was passed five minutes
    > > ago - an euphemism for beeing 5 minutes late).
    >
    > U-nnnn already exists (or I should say, it has existed). It was refering
    > to
    > 16-bit code units, not really to characters and was a fixed-width notation
    > (with 4 hexadecimal digits). The "U" meant "Unicode" (1.0 and before).
    >
    > U+[n...n]nnnn was created to avoid the confusion with the past 16-bit only
    > Unicode 1.0 standard (which was not fully compatible with ISO/IEC 10646
    > code
    > points). It is a variable-width notation that refers to ISO/IEC 10646 code
    > points. The "U" means "UCS" or "Universal Character Set". At that time,
    > the
    > UCS code point range was up to 31 bits wide.
    >
    > The U-nnnn notation is abandoned now, except for references to Unicode 1.0.
    > If one uses it, it will refer to one or more 16-bit code units needed to
    > encode each codepoint (possibly with surrogate pairs). It does not
    > designates abstract characters or codepoints unambiguously.
    >
    > Later, the variable-width U+[n...n]nnnn notation was restricted to allow
    > only codepoints in the 17 first planes of the joined ISO/IEC 10646-1 and
    > Unicode standards (so the only standard codepoints are between U+0000 and
    > U+10FFFF, some of them being permanently assigned to non-characters).
    >
    > The references to larger code points with U+[n...n]nnnn is discouraged, as
    > they no longer designate valid code points in both standards. Their
    > definition and use is then application-specific.
    >
    > There are '''no''' negative codepoints in either standards (U-0001 does
    > not
    > designate the 32-bit code unit that you could store in a signed wide-char
    > datatype, but in past standard it designated the same codepoint as U+0001
    > now). Using "+" makes the statement about signs clear: standard code
    > points
    > all have positive values.
    >
    > So if you want a representation for negative code units, you need another
    > notation (for example N-0001 to represent the negative code unit with
    > negative value -1): this notation is application-specific.
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Nov 09 2005 - 08:07:46 CST