From: Doug Ewell (dewell@adelphia.net)
Date: Mon Feb 17 2003 - 01:05:37 EST
Roozbeh Pournader <roozbeh at sharif dot edu> wrote:
> Found it! It's forbidden to start a HTML 4.0 page with a UTF-8 BOM.
> Proof:
> ...
> That's all. So the only characters that are allowed in a HTML 4.0 web
> page before the HTML header, are U+0009, U+000A, U+000C, U+000D,
> U+0020, and U+200B. QED.
I can't argue with the excellent gumshoe work Roozbeh did. But it does
seem peculiar, as Michka observed, that ZWSP should be a legal white
space character for this purpose but ZWNBSP should not; and as James
noted, it may have been an oversight. (I would add to Michka's comment
that it seems equally bizarre to allow U+000C FORM FEED at the start of
an HTML file but not U+FEFF.)
> PS: UTF-16 is an exception to that, since the BOM is not part of the
> document and should be removed for processing.
If this is true -- that U+FEFF is a kind of meta-character that doesn't
really belong to the text per se -- then it should be equally true for
UTF-8, whether its role is as a true Byte Order Mark (needed in UTF-16
and UTF-32 but not UTF-8) or as a signature (potentially useful in all
Unicode CES's). Only in its evil-twin role as a zero-width no-break
space is it truly part of the text, in which case the previous
discussion comments about white-space characters applies.
Michael (michka) Kaplan <michka at trigeminal dot com> wrote:
> Rather then treating HTML like the SQL standard (lofty goals that no
> one company completely supports because it would be insane to do it!)
> they can bend to the actual usage out there and just move on, right?
Michka is probably right that Notepad is one of the more popular HTML
editors out there, but even though I'm sure he didn't mean it this way,
I would prefer not to say anything that can be twisted into "the HTML
specification should be changed to match the way Microsoft does things."
That is bound to bring all the Microsoft haters out of the woodwork.
Rather, I would stress the inconsistency of allowing U+FEFF at the
beginning of an HTML file encoded in UTF-16 but not in one encoded in
the much more common UTF-8.
> Of course if I had a penny for every byte that has been used
> discussing these three bytes sometimes found at the beginning of a
> UTF-8 document, I would not be working this weekend; I'd be somewhere
> really warm and sunny.
There is so much disagreement, confusion, and misunderstanding
surrounding these three little bytes that I feel the discussion is
completely warranted. (At least nobody can ever claim it's off topic!)
Roozbeh responded:
> Well, that needs researching into what UTF-8 is in W3C and HTML 4.0
> terms:
> ...
> RFC 2279. A copy can be found at
> <http://www.ietf.org/rfc/rfc2279.txt>, or any other place you like and
> search for FEFF, BOM, ZERO WIDTH NO-BREAK SPACE, or the sequence "EF
> BB BF" there. Nothing can be found.
RFC 2279 defines and describes the technical structure of UTF-8. Usage
issues surrounding U+FEFF as either a signature or a ZWNBSP would have
been out of scope. Most Unicode and WG2 documents do not discuss the
BOM either.
Michka wrote back:
> If the problem was indeed due to a BOM then the answer *is* to fix the
> browser. Windows 2000 and XP have shipped onto a gazillion machines
> and a lot of people make quick spot changes to HTML pages in notepad.
> The BOM is here and any browser that cannot handle not showing either
> a BOM or a ZBNBSP can be classed as a dumb one.
Certainly, Microsoft is in a position to fix their own browser to make
it tolerant of the BOM. If they ship a quick and handy editor that
prepends a BOM to UTF-8 text files (which I think is a good idea, for
the reasons James cited), and if people are using that editor for HTML
files encoded in UTF-8, then their browser should behave sensibly when
handed an HTML file with a leading BOM. Messing up the layout at the
top of a page is not sensible, and displaying a Euro sign is just plain
weird.
But note that so far, all of the weirdness seems to be with IE 5.2 for
Macintosh. I've never seen any of this with IE 5.5 or 6.0 for Windows.
(Indeed, my Web pages all used to begin with BOMs and I never noticed a
problem, but I removed the BOMs when Michael Everson told me they
displayed badly on his Mac.) So it seems only the Mac version of IE
needs "fixing."
I don't see anything wrong with IE allowing a BOM at the start of
UTF-8-encoded HTML files, even if it is not expressly allowed by the
HTML specification. Browser vendors have certainly gone farther than
that to "extend" the standard in the past; remember Netscape's notorious
<blink> element? But I also think the HTML Working Group should
consider explicitly allowing the BOM at the start of HTML files encoded
in UTF-8. (Note that it is explicitly allowed in XML.)
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.5 : Mon Feb 17 2003 - 01:43:31 EST