Re: Problem with SSI and BOM

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sun Sep 24 2006 - 23:45:59 CST

  • Next message: Doug Ewell: "Re: Problem with SSI and BOM"

    On Sun, 24 Sep 2006, Doug Ewell wrote:

    > A process that claims to be able to "support Unicode"
    > should at least be able to follow the simple rule, "If the file or stream
    > starts with EF BB BF, throw them away and treat the remainder of the file or
    > stream as UTF-8."

    No, that would be incorrect if the character encoding of the data has been
    declared. It would be a mistake to start interpreting the octets of data
    in a manner othen than the declared encoding, at least as long as the data
    is formally correct according to the encoding. If the declared encoding
    is, say, ISO-8859-1, then EF BB BF has a well-defined meaning that has
    absolutely nothing to do with BOM. Even if the data happens to violate a
    higher-level protocol, such as HTML specification, it would be wrong to
    interpret it at the character level in a manner that violates fundamental
    protocols.

    > Even the W3C FAQ says: "In some browsers, the presence of a UTF-8 signature
    > will cause the browser to interpret the text as UTF-8 regardless of any
    > character encoding declarations to the contrary." That's exactly what it
    > should do.

    No, it's definitely something that browsers must not do when the character
    encoding has been declared, as it should, by the protocols. In the absence
    of declaration of encoding in any manner (HTTP header, meta tag, etc.),
    the browser may guess, and will, for obvious reasons. _Then_ the octet
    EF BB BF at the start of data may and should be treated as a good reason
    to make the heuristic guess that the data is UTF-8 encoded.
    >
    > The argument about accidentally throwing away a U+FEFF that was intended as a
    > ZWNBSP is becoming increasingly irrelevant;

    I'm not sure exactly which argument you are referring to. When performing
    file insertion via SSI or otherwise, it is certainly safe and
    recommendable to drop an eventual U+FEFF if it appears at the start of an
    included file. There's hardly any argument about this, though there might
    be practical problems in implementing (depending on how much control you
    have over the insertion mechanism).

    > U+2060 has been recommended over
    > ZWNBSP for over 4 years now, and few applications used ZWNBSP anyway.

    I'm afraid U+2060 is not widely supported, to put it mildly.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Sun Sep 24 2006 - 23:56:02 CST