Markus Scherer wrote:
> 
> we (ICU) are trying to figure out how best to specify non-BMP (21-bit) code
> points with escape sequences or similar in strings.
> 
> Problem:
> The C language has \ooo with octal digits for bytes of whatever encoding, and
> modern compilers also know \xhh with hexadecimal digits (with variable numbers
> of digits).
The situation you described is a bit old. I do not believe that a compiler
that does *not* support \xhh will be in current production use these days
(except if you are writing for PDP-11 or similar cases ;-)).
The ISO C standard, formerly ANSI C, has \xhh since the beginning (I understand
first copies were floating around in 1985; the standard was formally accepted
in 1989 by ANSI, and 1990 by ISO). It standardized existing practices in this
area, so there were a lot of "non-modern" compilers (that does not understand
prototypes, for instance), that do know about \xhh before even 1985.
The new revision, nicknamed C99, as well as the C++ Standard (1998), add \uxxxx
and \Uxxxxxxxx notations (x being any hexadecimal numbers). New compilers
are now shipping with this support (I admit there are not many of them).
> Java introduced \uhhhh with (always 4) hexadecimal digits for Unicode code units.
> 
> But how does one write a non-BMP code point in this fashion?
Use \Uxxxxxxxx. BTW, if a project like ICU is going to use such notation,
this will bring some pressure on compilers' providers (being GNU/FSF or
traditionnal vendors) to sort out the issue with Unicode coding (i.e.
meaning of wchar_t), which in the end will result in greater Unicode use.
 
> I am trying to list some suggestions, make a proposal, and ask you for what you
> are doing or other people/standards/organizations/languages are planning to do.
> 
> - One could use a pair of code units, UTF-16 style:
>   \ud89a\udcba
C99 explicitely forbids this (i.e., a message is required from a conforming
compiler).
> - In UTR 18, Mark Davis suggests a syntax
>   \vhhhhhh
>   with exactly 6 hexadecimal digits.
>   Drawback: I am afraid of confusion with the ANSI C language \v
>   for the vertical TAB.
You are correct, this is not an option.
 
> - How about - and I propose this here -
>   \whhhhhh
>   with, again, 6 hexadecimal digits?
>   It is simple, and for English speakers it has the benefit of
>   being mnemonic because of connotations with "wide" and the
>   letter being called a "double u" - which is more than a "\u" :-)
>   It is not used in C.
Correct, it is reserved for future extensions.
However, the benefit against \Uxxxxxxxx (which requires exactly 8 digits)
is a small gain in length (usually 3 "000"), but it lacks being a standard...
Marco Cimarosti answered:
> Frank da Cruz wrote:
> > In the Kermit language, we use:
> >  \x{yyy...}
> 
> Nice. I wish C was like that. It's certainly more practical than changing C
> and C++ standards every time a character encoding standard adds the next bit.
> ('Cause we *will* see a 32-bit character set sooner or later, won't we?)
Sorry. C standard *is* this way (but without the {}).
I mean, the \x notation is C is variable-length, and adjusts accordingly to
the underlying encoding (i.e., on a EBCDIC--targetted program, space is \x40;
and on a (theoritic) UTF16-targetted program, Amacron is \x0100, and the first
codepoint outside the BMP is \xD800\xDC00).
On the other hand, \u and \U notations are charset-independent; so \u0024 
and \U00000024 are two dollar signs ($), whatever the underlying encoding
used (being EBCDIC, UTF-8, etc.)
 
Hope it helps,
Antoine
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT