Re: Origin of the U+nnnn notation

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Nov 08 2005 - 14:40:41 CST

  • Next message: Hans Aberg: "Re: Åland"

    Antoine Leca noted:

    > I also remember asking about the introduction of the U+xxxxx and U+10xxxx
    > notation, perhaps in year 2000, and to be so confirmed by Dr. Whistler;
    > unfortunately my file archives are pretty bad, and I cannot found the post
    > right now (well, the interessant one here is Ken's answer, not mine); I did
    > not even remember if it was on this list, silly me.

    This might be recalling a rather long thread from May 2000,
    regarding \uxxxx notation, in which Antoine participated
    and Markus Scherer concluded:

    > We have a winner: the new (draft) C _and_ C++ standards are introducing
    > \uhhhh (fixed-length, 4 hex digits) and
    > \Uhhhhhhhh (fixed-length, 8 hex digits)
    >
    > while Perl and Kermit are using
    > \x{hh...h} (variable-length, hex digits, I guess 1..8 of them)

    But I don't spot anything in that thread about the history of the U+xxxx
    notation per se.

    The use of the U+xxxx notation in publications goes back to
    Unicode 1.0 (1991), where it was explicitly used, and explained on
    p. xv:

      "An individual Unicode value is expressed as U+nnnn, where
       nnnn is a four digit number in hexadecimal notation, ..."
       
    The usage appears in draft documents from late 1989,
    so the convention itself dates back to then.
       
    The introduction of the short identifiers in ISO/IEC 10646 was
    in part an attempt to grandfather this usage into 10646 and make it
    recognized and valid for the Unicode Standard as an implementation
    of 10646. The initial edition of 10646-1:1993 did not have them,
    and simply used 4-digit hex or 8-digit hex for UCS-2 or
    UCS-4, respectively.

    10646-1:2000 (the 'second edition') added short identifiers
    in clause 6.5, defined as:

      "The full syntax of the notation of a short identifier, in
       Backus-Naur form, is:
       
        {U | u}[{+}xxxx | {-}xxxxxxxx] "

    The formal source for that was Amd 9 to 10646-1:1993. And
    the history of that amendment is that it was initiated in response to
    a liaison report from SC22 to SC2, dated September 22, 1995,
    requesting that 10646 add short unique identifiers, for
    use by other standards. The PDAM 9 was issued in April 1996,
    and Amd 9 was actually published in 1997.

    The specification has since been modified to:

      "The full syntax of the notation of a short identifier, in
       Backus-Naur form, is:
       
        {U | u}[{+}(xxxx | xxxxx | xxxxxx) | {-}xxxxxxxx] "

    This modification was to account for practice
    that uses 5- and 6-digit forms for the supplementary characters
    (U+10000..U+10FFFF).

    What is little-known generally is that the "U+" convention itself
    was an ASCII-fied compromise for what the Unicode designers
    *really* wanted to use for the Unicode hexadecimal prefix,
    which was U+228E MULTISET UNION (whose glyph is a union sign
    with a plus sign in it). That symbol can actually be spotted in
    some of the early Unicode collateral (T-shirts, stationery,
    business cards, etc.), because it was used as part of the original
    Unicode logo design, before the switch to the now ubiquitous
    Uni design that has been used for more than a decade.

    The semantic appropriateness of MULTISET UNION as a designator
    for Unicode code points ought to be apparent, and the shape of
    the union symbol itself was iconic for the "U" of Unicode. But
    use of the symbol in data files and documentation in the
    early days was problematical, of course, and it soon gave way
    to the much more practical use of "U+" instead.

    --Ken

    P.S. This tale is part of the story to be written for U+228E.



    This archive was generated by hypermail 2.1.5 : Tue Nov 08 2005 - 14:42:58 CST