Re: 32'nd bit & UTF-8

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Thu Jan 20 2005 - 05:41:46 CST

  • Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

    "Arcane Jill" <arcanejill@ramonsky.com> writes:

    >> The main point is that BOM will not be specially treated in the UNIX world,
    >> regardless what Unicode says. So I guess MS does not want its text files to
    >> be read in the UNIX world. Unicode has made the mistake of favoring a
    >> special platform over all the others.
    >
    > It would be more accurate to say that Unicode Conformant Processes
    > often do not care if non-Unicode-Conformant Processes can't read them.
    > Unicode has therefore "made the mistake" of favoring processes that
    > conform the Unicode Standard over those that don't.

    Wrong. Nobody complains about that, this is a tautology.

    We complain that a process must treat BOM in UTF-8 specially in order
    to be called a "Unicode comformant process". Please don't take this
    implication for granted because *this* is what is being complained about.

    > The "locale" notion, as Lars made plain to us last year, imposes a
    > limitation that one cannot correctly interpret two different documents
    > having different encodings in the same "locale". This, to me, sucks.

    But it's the fact of life. Documents are usually not automatically
    tagged with their encoding, it must be dealt with in metadata.

    Even if the locale encoding is stateful, it must be possible to reset
    the state to the initial state (conforming programs use appropriate
    calls of conversion functions to emit byte sequences which reset the
    state, e.g. iconv with NULL as the input buffer).

    This is not possible for UTF-8 with BOM: after you have emitted some
    text, it's impossible to emit something such that if we append another
    stream, the whole sequence of bytes will form a valid stream with
    concatenated contents. It's worse than only being stateful.

    This makes UTF-8 with BOM incompatible with Unix encoding model.
    UTF-8 without BOM works fine.

    That's why BOM is being ignored. It cannot be transparently handled
    by the recoding machinery, so each program would have to handle it
    itself. And it's impossible conceptually in cases where "the beginning
    of the text stream" is not a well-defined concept - programs won't
    be "fixed" because there is no correct fix, other than rejecting the
    idea of UTF-8 BOM.

    There are also security implications of handling BOM automatically.
    This is the only place where UTF-8 does not yield a unique encoding
    of a sequence of code points: it may generate the BOM or not, and the
    results are expected to be treated the same. So depending on whether
    comparison is performed in terms of code points or in terms of UTF-8
    byte strings, the outcome may be different. I think this is the reason
    Unicode has changed or clarified the interpretation of overlong UTF-8
    sequences, by declaring them just invalid instead of allowing decoders
    to process them, to ensure that they have a consistent view about
    equal strings. But it forgot about BOM, which has similar implications
    and should meet the same fate.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 05:43:12 CST