Re: Names for UTF-8 with and without BOM

From: John Cowan (
Date: Sat Nov 02 2002 - 19:45:47 EST

    Tex Texin scripsit:

    > So when the parser gets JOECODE, I can understand ignoring the signature
    > and autodetection, but exactly how does it find the first "<"?

    Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might
    be UTF-32 big-endian, but we'll suppose the parser can't handle that).
    JOECODE is what's left. At worst it is in some other encoding and/or
    not well-formed, in which case you expect an error and you get one.
    Of course the processor knows that "<" is encoded as 0xFF in JOECODE....

    The point is that signatures don't decode to a character: processors in
    general, not just XML processors, are expected to skip them.

    > It must have to try all of the encodings known to it... ugh.

    In such a bad case, that's all you can do.

