From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Feb 25 2003 - 18:36:40 EST
Frank Tang asked:
> so the UTF-8 sequence which represent U+FFFE U+FFFF and U+{1-11}FFF{E,F}
> are consider legal in Unicode 4.0
Yes. Such sequences are also legal in Unicode 3.0, 3.1, and 3.2.
The Unicode Standard, Version 3.0 specified, on p. 46:
"To ensure that round-trip transcoding is possible, a UTF
mapping *must also* map invalid Unicode scalar values to
unique code value sequences. These invalid scalar values
include FFFE<sub>16</sub>, FFFF<sub>16</sub>, and unpaired
surrogates."
The Unicode Standard, Version 3.1 disallowed non-shortest
UTF-8 sequences, which it defined to be illegal. It disallowed
the *generation* of irregular UTF-8 sequences (which involve
the mapping of surrogate code points). Unicode 3.1 also
defined the term "noncharacter", which includes U+FFFE,
U+FFFF, the last two characters on each of the other planes,
and U+FDD0..U+FDEF, and all of *those* values were perfectly
valid in UTF-8, as shown by Table 3.1B, "Legal UTF-8 Byte
Sequences."
The Unicode Standard, Version 3.2, changed the term "illegal"
to "ill-formed", and disallowed all ill-formed UTF-8
sequences, including the CESU-8-style irregular sequences.
However, once again, noncharacters are perfectly valid in
Table 3.1B, Legal UTF-8 Byte Sequences.
The relevant text from Unicode 4.0, Chapter 3, is:
"D28 Unicode scalar value: any Unicode code point except
high-surrogate and low-surrogate code points."
"D36 UTF-8 encoding form: the Unicode encoding form which
assigns each Unicode scalar value to an unsigned byte sequence
of one to four bytes in length, as specified in Table 3-5.
* Any UTF-8 byte sequence that does not match the patterns
listed in Table 3-6 is ill-formed."
And "Table 3-5" is basically equivalent to Table 3-1 of
Unicode 3.0 (see p. 47), while "Table 3-6" is equivalent
to Table 3.1B "Legal UTF-8 Byte Sequences", published
in Unicode 3.2.
If you read through those definitions from Unicode 4.0 carefully,
you will see that UTF-8 representing a noncharacter is perfectly
valid, but UTF-8 representing an unpaired surrogate code point
is ill-formed (and therefore disallowed).
Through all of these tightenings of the wording regarding
UTF-8, it has continuously been true (for Unicode 3.0, 3.1,
3.2, and 4.0) that UTF-8 for noncharacter code points is valid.
ISO/IEC 10646-1:2000 had a flaw in it, in that Annex D
contained language in a note indicating that the UTF-8
for 0000FFFE and 0000FFFF was not defined (while allowing
0001FFFE, 0001FFFF, etc.). That flaw was corrected in
Amendment 1 to ISO/IEC 10646-1:2000, so at this point, the
definition in the Unicode Standard and the definition in
10646 are perfectly aligned.
So let me repeat the summary, for those who have gotten
this far:
UTF-8 for noncharacters is *valid*.
UTF-8 for surrogate code points is *ill-formed*. (Unicode-ese)
for RC-elements is *undefined*. (10646-ese)
--Ken
This archive was generated by hypermail 2.1.5 : Tue Feb 25 2003 - 19:21:49 EST