From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Thu Jan 20 2005 - 05:41:46 CST
"Arcane Jill" <arcanejill@ramonsky.com> writes:
>> The main point is that BOM will not be specially treated in the UNIX world,
>> regardless what Unicode says. So I guess MS does not want its text files to
>> be read in the UNIX world. Unicode has made the mistake of favoring a
>> special platform over all the others.
>
> It would be more accurate to say that Unicode Conformant Processes
> often do not care if non-Unicode-Conformant Processes can't read them.
> Unicode has therefore "made the mistake" of favoring processes that
> conform the Unicode Standard over those that don't.
Wrong. Nobody complains about that, this is a tautology.
We complain that a process must treat BOM in UTF-8 specially in order
to be called a "Unicode comformant process". Please don't take this
implication for granted because *this* is what is being complained about.
> The "locale" notion, as Lars made plain to us last year, imposes a
> limitation that one cannot correctly interpret two different documents
> having different encodings in the same "locale". This, to me, sucks.
But it's the fact of life. Documents are usually not automatically
tagged with their encoding, it must be dealt with in metadata.
Even if the locale encoding is stateful, it must be possible to reset
the state to the initial state (conforming programs use appropriate
calls of conversion functions to emit byte sequences which reset the
state, e.g. iconv with NULL as the input buffer).
This is not possible for UTF-8 with BOM: after you have emitted some
text, it's impossible to emit something such that if we append another
stream, the whole sequence of bytes will form a valid stream with
concatenated contents. It's worse than only being stateful.
This makes UTF-8 with BOM incompatible with Unix encoding model.
UTF-8 without BOM works fine.
That's why BOM is being ignored. It cannot be transparently handled
by the recoding machinery, so each program would have to handle it
itself. And it's impossible conceptually in cases where "the beginning
of the text stream" is not a well-defined concept - programs won't
be "fixed" because there is no correct fix, other than rejecting the
idea of UTF-8 BOM.
There are also security implications of handling BOM automatically.
This is the only place where UTF-8 does not yield a unique encoding
of a sequence of code points: it may generate the BOM or not, and the
results are expected to be treated the same. So depending on whether
comparison is performed in terms of code points or in terms of UTF-8
byte strings, the outcome may be different. I think this is the reason
Unicode has changed or clarified the interpretation of overlong UTF-8
sequences, by declaring them just invalid instead of allowing decoders
to process them, to ensure that they have a consistent view about
equal strings. But it forgot about BOM, which has similar implications
and should meet the same fate.
-- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 05:43:12 CST