From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 14 2004 - 15:27:57 CST
Marcin Kowalczyk noted:
> Unicode has the following property. Consider sequences of valid
> Unicode characters: from the range U+0000..U+10FFFF, excluding
> non-characters (i.e. U+nFFFE and U+nFFFF for n from 0 to 0x10 and
> U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
> in any UTF-n, and nothing else is expected from UTF-n.
Actually not quite correct. See Section 3.9 of the standard.
The character encoding forms (UTF-8, UTF-16, UTF-32) are defined
on the range of scalar values for Unicode: 0..D7FF, E000..10FFFF.
Each of the UTF's can represent all of those scalar values, and
can be converted accurately to either of the other UTF's for
each of those values. That *includes* all the code points used
for noncharacters.
U+FFFF is a noncharacter. It is not assigned to an encoded
abstract character. However, it has a well-formed representation
in each of the UTF-8, UTF-16, and UTF32 encoding forms,
namely:
UTF-8: <EF BF BF>
UTF-16: <FFFF>
UTF-32: <0000FFFF>
> With the exception of the set of non-characters being irregular and
> IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
> limit caused by UTF-16, this gives a precise and unambiguous set of
> values for which encoders and decoders are supposed to work.
Well, since conformant encoders and decoders must work for all
the noncharacter code points as well, and since U+10FFFF, however
odd numerologically, is itself precise and unambiguous, I don't
think you even need these qualifications.
> Well,
> except non-obvious treatment of a BOM (at which level it should be
> stripped? does this include UTF-8?).
The handling of BOM is relevant to the character encoding *schemes*,
where the issues are serialization into byte streams and interpretation
of those byte streams. Whether you include U+FEFF in text or not
depends on your interpretation of the encoding scheme for a Unicode
byte stream.
At the level of the character encoding forms (the UTF's), the
handling of BOM is just as for any other scalar value, and is
completely unambiguous:
UTF-8: <EF BB BF>
UTF-16: <FEFF>
UTF-32: <0000FEFF>
>
> A variant of UTF-8 which includes all byte sequences yields a much
> less regular set of abstract string values. Especially if we consider
> that 11101111 10111111 10111110 binary is not valid UTF-8, as much as
> 0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
> order for a BOM to fulfill its role).
This is incorrect. <EF BF BE> *is* valid UTF-8, just as <FFFE> is
valid UTF-16. In both cases these are valid representations of
a noncharacter, which should not be used in public interchange,
but that is a separate issue from the fact that the code unit
sequences themselves are "well-formed" by definition of the
Unicode encoding forms.
>
> Question: should a new programming language which uses Unicode for
> string representation allow non-characters in strings?
Yes.
> Argument for
> allowing them: otherwise they are completely useless at all, except
> U+FFFE for BOM detection. Argument for disallowing them: they make
> UTF-n inappropriate for serialization of arbitrary strings, and thus
> non-standard extensions of UTF-n must be used for serialization.
Incorrect. See above. No extensions of any of the encoding forms
are needed to handle noncharacters correctly.
--Ken
This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 15:30:19 CST