From section 2.2 of a final draft of ISO/IEC FDIS 14882, Programming
languages --- C++:
-------------------------------
-2- The universal-character-name construct provides a way to name other
characters. 
hex-quad:
        hexadecimal-digit hexadecimal-digit hexadecimal-digit
hexadecimal-digit
universal-character-name:
        \u hex-quad
        \U hex-quad hex-quad
The character designated by the universal-character-name \UNNNNNNNN is that
character whose character short name in ISO/IEC 10646 is NNNNNNNN; the
character designated by the universal-character-name \uNNNN is that
character whose character short name in ISO/IEC 10646 is 0000NNNN. If the
hexadecimal value for a universal character name is less than 0x20 or in the
range 0x7F-0x9F (inclusive), or if the universal character name designates a
character in the basic source character set, then the program is ill-formed.
-------------------------------
--- Paul
> -----Original Message-----
> From: Markus Scherer [mailto:markus.scherer@jtcsv.com]
> Sent: Tuesday, May 23, 2000 11:37 AM
> To: Unicode List
> Subject: Question about \uxxxx etc. for 21-bit code points - 
> need advice
> 
> 
> Hello,
> 
> we (ICU) are trying to figure out how best to specify non-BMP 
> (21-bit) code points with escape sequences or similar in strings.
> 
> Problem:
> The C language has \ooo with octal digits for bytes of 
> whatever encoding, and modern compilers also know \xhh with 
> hexadecimal digits (with variable numbers of digits).
> Java introduced \uhhhh with (always 4) hexadecimal digits for 
> Unicode code units.
> 
> But how does one write a non-BMP code point in this fashion?
> 
> I am trying to list some suggestions, make a proposal, and 
> ask you for what you are doing or other 
> people/standards/organizations/languages are planning to do.
> 
> - One could use a pair of code units, UTF-16 style:
>   \ud89a\udcba
>   This is clumsy because
>   + it is long
>   + the code point needs to be factored into surrogates
>   + it works all right only if the underlying string encoding
>     is UTF-16; if UTF-32 or UTF-8 are used internally, then
>     the escape-sequence parser actually needs to detect two
>     subsequent \u's, make sure that they form a matched pair,
>     and combine them into a code point.
>     For UTF-8, it then has to be factored again into bytes.
> 
> - In UTR 18, Mark Davis suggests a syntax
>   \vhhhhhh
>   with exactly 6 hexadecimal digits.
>   Drawback: I am afraid of confusion with the ANSI C language
>   \v
>   for the vertical TAB.
> 
> - How about - and I propose this here -
>   \whhhhhh
>   with, again, 6 hexadecimal digits?
>   It is simple, and for English speakers it has the benefit of
>   being mnemonic because of connotations with "wide" and the
>   letter being called a "double u" - which is more than a "\u" :-)
>   It is not used in C.
> 
> - Should there be a delimited, variable-length form like
>   \whh...h;
>   or
>   \w{hh...h}
>   or similar, closer to HTML?
> 
> Of course, a longer form would coexist with the common 
> \uhhhh, so that the longer one would be in practice used only 
> for code points >0xffff. This seems to remove the motivation 
> for a variable-length form. For ICU resource bundles, the 
> 2-digit \xhh (for the Latin-1 subset) and the 4-digit \uhhhh 
> already coexist.
> 
> I don't know what Java is planning to do, or if C/C++ 
> standards actually deal with Unicode and related issues at 
> all (beyond what I read in the ANSI C standard from 1990).
> 
> What are Microsoft or Apple planning?
> 
> Markup languages for comparison:
> 
> The HTML and XML and related languages already have a 
> mechanism for referencing any Unicode code point, although 
> only the XML specification actually explicitly talks about 
> the range reaching up to 0x10ffff. The older HTML 
> specification only refers to "ISO 10646 character numbers", 
> but by referring to the ISO UCS, I assume that they actually 
> allow code points up to 0x7fffffff.
> 
> Syntactically, however, &#dd...d; and &#xhh...h; do not fit 
> in well with backslash-escapes.
> 
> 
> Please advice!
> 
> markus
> 
> 
> HTML and XML references:
> 
> HTML: http://www.w3.org/TR/html401/charset.html#entities 
> Chapter 5.3 "Character references" specifies the decimal and 
> hexadecimal numeric character references as "ISO 10646 
> character numbers" without explicitly mentioning the range of 
> those numbers.
> 
> XML: http://www.w3.org/TR/REC-xml Chapter 4.1 "Character and 
> Entity References" refers to _code points_ of ISO/IEC 10646.
> In the same document, Chapter 2.2 "Characters" specifies the 
> character range for XML to be that of UTF-16 (minus 
> characters that are not legal in XML).
> 
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT