Re: BOM's at Beginning of Web Pages?

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Feb 18 2003 - 00:11:26 EST

  • Next message: Doug Ewell: "Re: Bidi overrides and chocolate paper"

    Tex Texin <tex at i18nguy dot com> wrote:

    > 2) Much of this discussion seems to take place without looking at the
    > timelines of the various docs.
    > The UTF-8 BOM is relatively recent addition to Unicode. Further it is
    > not necessary, (IE provides no information of value to the browser) so
    > modifying the specs to include it hardly seems worthwhile.

    The BOM-as-encoding-signature dates back to the publication of Unicode
    1.0, Volume 2 (p. 7) in 1992:

    "The code value of FEFF is assigned a 'signature' role in Informative
    Annex E to DIS 10646. Since UCS-2 is roughly equivalent to the Unicode
    encoding, this convention for discerning between forms UCS-2 and UCS-4
    is recommended to the attention of implementers of the Unicode standard.
    ...
    "Note that a character stream starting off with bytes FE and FF is
    unlkely to be ASCII text. Data streams (or files) that begin with
    16-bit NULL followed by ZERO-WIDTH NO-BREAK SPACE could be considered as
    likely to contain UCS-4 data; streams beginning with ZERO-WIDTH NO-BREAK
    SPACE alone could be considered as likely to contain Unicode values. An
    application receiving data streams of coded characters may either use
    these signatures to identify the coded representation form, or may
    ignore them and treat FEFF as the ZERO-WIDTH NO-BREAK SPACE character."

    Of course, the ambiguity of allowing applications to treat U+FEFF as
    either a BOM/signature *or* a ZWNBSP was later recognized to be a
    greater problem than originally anticipated, eventually leading to the
    creation of labels like "UTF-16LE." But there is at least a 10-year
    history of using U+FEFF as a signature to distinguish between encoding
    schemes, even though the original text only addressed the distinction
    between "the Unicode encoding" and UCS-4. (Remember that UTF-8, then
    called FSS-UTF, was not introduced until Unicode 1.1 in 1993.)

    > 4) I don't see any real problems caused by the inconsistency of
    > supporting a UTF-16 BOM and not supporting a UTF-8 BOM.
    > Note that in HTML the BOM is only used to identify byte ordering. It
    > is not used to indicate the encoding (unlike XML).

    The HTML spec does say that, and that is a very good point. It is
    frequently pointed out that UTF-8 does not need a byte order mark, which
    is true but usually not relevant to discussions about using it as a
    signature. But in the case of HTML, byte order really is the issue.

    Hmm, and what about the case where a file begins with 0xEF 0xBB 0xBF but
    then goes on to include a meta-charset declaration of, say, Latin-1? I
    think I'm beginning to see a problem here.

    > There are already 2 legal ways to declare an encoding HTTP, and the
    > META content-type statement (ignoring the generally unsupported ANCHOR
    > charset for links). We do not need a UTF-8 BOM which neither declares
    > an encoding nor identifies a serialization.

    It does declare an encoding, sort of, but not in the standard HTML way,
    and it could conflict with the standard declaration as mentioned above.

    Unfortunately, as someone said (sorry, I've already forgotten who), not
    everyone is his own Webmaster and has control over what HTTP headers are
    sent out. I certainly don't; that's up to Adelphia.

    > 5) References to RFC 2279 are depressing. It is overdue for an update
    > as it references 6 byte transformations.

    This is beside the point of why Roozbeh and I mentioned it. (BTW, I
    still prefer the RFC 2279 explanation of UTF-8 to anything I have seen
    in the Unicode book or Web site.)

    > 6) Doug you surprised me! I thought you were a supporter of
    > standards... How can we have standards while recommending people
    > modify their products to accommodate whatever characters or
    > innovations suits them. The mistakes of browser vendors in the past
    > is not a good justification for ad hoc changes today.

    Well, I am a supporter of standards, and I thought I was suggesting only
    a slight and relatively harmless bending of the HTML letter-of-the-law.
    (The old maxim, "Be conservative in what you send and liberal in what
    you accept.") I thought allowing an initial U+FEFF was far less
    cavalier than some other things browsers do, and Deborah confirms that
    browsers sometimes have to be liberal. But I concede that there is a
    potential problem if the file starts with a UTF-8 signature and the
    meta-charset declaration specifies something other than UTF-8.

    I do think something may need to be done at Microsoft (I don't know
    what) about the problem of Notepad writing UTF-8 files that contain a
    signature and IE displaying them in an unexpected way. I don't think
    Notepad is anybody's favorite editor in the world, but it's definitely
    "good enough" for many purposes (not to mention free and ubiquitous).

    The Notepad practice of automatically prepending a signature to UTF-8
    files does has the major advantage that naïve users don't have to worry
    about the file type when they load it back into Notepad later. The
    "NESTLÉ®" problem shows that UTF-8 autodetection can't be guaranteed to
    work 100% of the time. If a user saves a file and then reloads it and
    it ends up corrupted because the autodetection failed, she may blame
    Unicode rather than the editor. I'd say that by ensuring the success of
    UTF-8 loads and saves, without requiring any intervention on the part of
    the user, Notepad's UTF-8 signature convention may actually help spread
    the use of Unicode among Windows users.

    In summary,

    1. OK, you're right: HTML files in UTF-8 should not begin with a
    signature.
    2. But that's only because the HTML spec says so, not because UTF-8
    signatures are evil.
    3. Notepad writes UTF-8 files with signatures for a good reason.
    4. So for now at least, don't use Notepad for HTML. (Try SC UniPad
    instead. :-)
    5. None of this applies to Unix or Linux systems, which can't handle
    any type of file signature.

    > Please tell me it was just a case of your not having had
    > your morning coffee yet... ;-)

    In my case it would be tea; but yes, maybe that was the problem.

    -Doug Ewell
     Fullerton, California



    This archive was generated by hypermail 2.1.5 : Tue Feb 18 2003 - 01:05:11 EST