From: Doug Ewell (dewell@adelphia.net)
Date: Tue Feb 18 2003 - 00:11:26 EST
Tex Texin <tex at i18nguy dot com> wrote:
> 2) Much of this discussion seems to take place without looking at the
> timelines of the various docs.
> The UTF-8 BOM is relatively recent addition to Unicode. Further it is
> not necessary, (IE provides no information of value to the browser) so
> modifying the specs to include it hardly seems worthwhile.
The BOM-as-encoding-signature dates back to the publication of Unicode
1.0, Volume 2 (p. 7) in 1992:
"The code value of FEFF is assigned a 'signature' role in Informative
Annex E to DIS 10646. Since UCS-2 is roughly equivalent to the Unicode
encoding, this convention for discerning between forms UCS-2 and UCS-4
is recommended to the attention of implementers of the Unicode standard.
...
"Note that a character stream starting off with bytes FE and FF is
unlkely to be ASCII text. Data streams (or files) that begin with
16-bit NULL followed by ZERO-WIDTH NO-BREAK SPACE could be considered as
likely to contain UCS-4 data; streams beginning with ZERO-WIDTH NO-BREAK
SPACE alone could be considered as likely to contain Unicode values. An
application receiving data streams of coded characters may either use
these signatures to identify the coded representation form, or may
ignore them and treat FEFF as the ZERO-WIDTH NO-BREAK SPACE character."
Of course, the ambiguity of allowing applications to treat U+FEFF as
either a BOM/signature *or* a ZWNBSP was later recognized to be a
greater problem than originally anticipated, eventually leading to the
creation of labels like "UTF-16LE." But there is at least a 10-year
history of using U+FEFF as a signature to distinguish between encoding
schemes, even though the original text only addressed the distinction
between "the Unicode encoding" and UCS-4. (Remember that UTF-8, then
called FSS-UTF, was not introduced until Unicode 1.1 in 1993.)
> 4) I don't see any real problems caused by the inconsistency of
> supporting a UTF-16 BOM and not supporting a UTF-8 BOM.
> Note that in HTML the BOM is only used to identify byte ordering. It
> is not used to indicate the encoding (unlike XML).
The HTML spec does say that, and that is a very good point. It is
frequently pointed out that UTF-8 does not need a byte order mark, which
is true but usually not relevant to discussions about using it as a
signature. But in the case of HTML, byte order really is the issue.
Hmm, and what about the case where a file begins with 0xEF 0xBB 0xBF but
then goes on to include a meta-charset declaration of, say, Latin-1? I
think I'm beginning to see a problem here.
> There are already 2 legal ways to declare an encoding HTTP, and the
> META content-type statement (ignoring the generally unsupported ANCHOR
> charset for links). We do not need a UTF-8 BOM which neither declares
> an encoding nor identifies a serialization.
It does declare an encoding, sort of, but not in the standard HTML way,
and it could conflict with the standard declaration as mentioned above.
Unfortunately, as someone said (sorry, I've already forgotten who), not
everyone is his own Webmaster and has control over what HTTP headers are
sent out. I certainly don't; that's up to Adelphia.
> 5) References to RFC 2279 are depressing. It is overdue for an update
> as it references 6 byte transformations.
This is beside the point of why Roozbeh and I mentioned it. (BTW, I
still prefer the RFC 2279 explanation of UTF-8 to anything I have seen
in the Unicode book or Web site.)
> 6) Doug you surprised me! I thought you were a supporter of
> standards... How can we have standards while recommending people
> modify their products to accommodate whatever characters or
> innovations suits them. The mistakes of browser vendors in the past
> is not a good justification for ad hoc changes today.
Well, I am a supporter of standards, and I thought I was suggesting only
a slight and relatively harmless bending of the HTML letter-of-the-law.
(The old maxim, "Be conservative in what you send and liberal in what
you accept.") I thought allowing an initial U+FEFF was far less
cavalier than some other things browsers do, and Deborah confirms that
browsers sometimes have to be liberal. But I concede that there is a
potential problem if the file starts with a UTF-8 signature and the
meta-charset declaration specifies something other than UTF-8.
I do think something may need to be done at Microsoft (I don't know
what) about the problem of Notepad writing UTF-8 files that contain a
signature and IE displaying them in an unexpected way. I don't think
Notepad is anybody's favorite editor in the world, but it's definitely
"good enough" for many purposes (not to mention free and ubiquitous).
The Notepad practice of automatically prepending a signature to UTF-8
files does has the major advantage that naïve users don't have to worry
about the file type when they load it back into Notepad later. The
"NESTLÉ®" problem shows that UTF-8 autodetection can't be guaranteed to
work 100% of the time. If a user saves a file and then reloads it and
it ends up corrupted because the autodetection failed, she may blame
Unicode rather than the editor. I'd say that by ensuring the success of
UTF-8 loads and saves, without requiring any intervention on the part of
the user, Notepad's UTF-8 signature convention may actually help spread
the use of Unicode among Windows users.
In summary,
1. OK, you're right: HTML files in UTF-8 should not begin with a
signature.
2. But that's only because the HTML spec says so, not because UTF-8
signatures are evil.
3. Notepad writes UTF-8 files with signatures for a good reason.
4. So for now at least, don't use Notepad for HTML. (Try SC UniPad
instead. :-)
5. None of this applies to Unix or Linux systems, which can't handle
any type of file signature.
> Please tell me it was just a case of your not having had
> your morning coffee yet... ;-)
In my case it would be tea; but yes, maybe that was the problem.
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.5 : Tue Feb 18 2003 - 01:05:11 EST