Re: Displaying Plane 1 characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Nov 11 1998 - 22:20:46 EST


Keld opined:

>
> > >Java is also going to get problems: "\u10208" would be mistaken as
> > >U+1020 <undefined Mongolian character> U+0038 DIGIT EIGHT instead
> > >of U-00010208 ETRUSCAN LETTER TH.
> >
> > \uD800\uDE08 is an obvious answer for Java, since Java's 16-bit data
> > type implies its use of UTF-16.
>
> Yoou should not use \uxxxx nothation for surrogates,
> as surrogates are not charcters in neither Unicode nor 10646,
> and thus the short identifiers cannot be used.
>

Technically, Keld is correct about the use of short identifiers
in UTF-16. Amendment 9 of 10646-1 specifies the use of the
long identifiers (U-00010208 ~ 00010208) or the short identifiers
(U+0208 ~ 0208) only for *characters*, and not for "RC-elements",
which is the 10646 term for surrogate code values. So Amendment
9 does not sanction "U+D800 U+DE08".

However, Amendment 1, which defines UTF-16, does in fact use
four digit hex notation (without the "U+") to refer to RC-element
values:

[0048][0069][D800][DC00][0021][0021]

This is so obvious, and so obviously required, that I doubt it
even crossed anybody's mind when reviewing the text for Amendment 9.
If you don't do something like this, there is no way to even talk
about the values of UTF-16, and since UTF-16 is a normative part
of the standard, people do have to talk about and represent the
values -- and they do so in four-digit hex.

That said, it is also clear that the Unicode Standard sanctions
the use of "U+" with surrogate values, as an extension of the
identifier mechanism in 10646-1, with the caveat that an unpaired
surrogate value, e.g. "U+D800" is *not* a representation of
a character, whereas properly paired surrogates, e.g. "U+D800 U+DB00"
*are* a representation of a character.

It is pretty clear that this is what people want to be able to
do for consistent representation of coded values in UTF-16,
and I consider it grounds for a defect report against Clause
6.5 of ISO/IEC 10646-1, rather than the basis for a "thou shalt not"
broadcast to the Unicode list.

As for Java, the Java convention of "\u" prefixing is completely
outside the scope of 10646-1, and Java implementers should feel
free to do whatever is required for Java implementations to do
the right thing, heedless of chiding from SC2.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT