From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Feb 27 2003 - 15:42:43 EST
Frank Tang asked:
> >> This discussion has been centered around UTF-8. But I hope the
> >>corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0:
> >>
> >>. for UTF-32: occurrences of 'surrogates' are ill-formed.
> >>
> >>
> >>
> How about UTF-32 sequence which the 4 bytes represent value > U+10FFFF ?
> Are they considered ill-formed? Should they?
Yes, they are ill-formed.
Since all the encoding forms are based on the Unicode scalar values,
and since the Unicode scalar values are *defined* to be the
range 0x0000..0xD7FF, 0xE000..0x10FFFF, any attempt to represent
a code point higher than U+10FFFF in *any* encoding form is
ill-formed.
This will be called out explicitly in the Unicode 4.0 text, in
case anyone still has the question:
" * Any UTF-32 code unit greater than 0010FFFF<sub>16</sub> is
ill-formed."
I can keep answering these questions, but I can also assure
everyone that the UTC worked *very* hard this time around to
make the character encoding model much clearer in the Unicode 4.0
text, and to anticipate all these edge cases.
--Ken
This archive was generated by hypermail 2.1.5 : Thu Feb 27 2003 - 16:27:20 EST