From: Marcin ‘Qrczak’ Kowalczyk (qrczak@knm.org.pl)
Date: Wed May 28 2008 - 05:11:19 CDT
2008/5/28 Kenneth Whistler <kenw@sybase.com>:
>> UTF-16, after all, is stateful: if you lose the BOM,
>> things can look very different.
>
> That is true of the UTF-16 encoding *scheme*. (See TUS 5.0,
> D98, p. 106.) That is because in the UTF-16 encoding scheme,
> an initial BOM is itself a stateful switch for byte order.
> UTF-16BE and UTF-16LE, on the other hand are not stateful.
It is a pity that UTF-8 is somewhat ambiguous over whether it is
stateful. A UTF-8 with a BOM is stateful: the decoder must remember
whether it has seen a BOM or whether it is past the beginning, and the
encoder must remember if it is at the beginning, to know whether to
emit U+FEFF twice for the case when the data begins with U+FEFF. A
UTF-8 without any special treatment of U+FEFF at the beginning is
stateless. Both variants of UTF-8 are in use. It would be better to
distinguish them explicitly, like UTF-16 is distinguished from
UTF-16BE & UTF-16LE.
-- Marcin Kowalczyk qrczak@knm.org.pl http://qrnik.knm.org.pl/~qrczak/
This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 05:14:02 CDT