From: Tex Texin (tex@i18nguy.com)
Date: Mon Feb 17 2003 - 05:30:42 EST
Dudes and Dudettes,
Not sure I read all of the thread, but:
1) BOM is not only allowed but recommended in HTML UTF-16 documents.
see section 5.1
http://www.w3.org/TR/REC-html40/charset.html
I am not sure what the comment about removing BOM is referring to. Is that
someone's explanation or is it in the standard somewhere?
2) Much of this discussion seems to take place without looking at the
timelines of the various docs.
The UTF-8 BOM is relatively recent addition to Unicode. Further it is not
necessary, (IE provides no information of value to the browser) so modifying
the specs to include it hardly seems worthwhile.
3) Good idea about not bringing out the Microsoft haters.
The argument itself is weak enough to be laughable. Driving specifications
based on notepad behavior indeed.
4) I don't see any real problems caused by the inconsistency of supporting a
UTF-16 BOM and not supporting a UTF-8 BOM.
Note that in HTML the BOM is only used to identify byte ordering. It is not
used to indicate the encoding (unlike XML).
There are already 2 legal ways to declare an encoding HTTP, and the META
content-type statement (ignoring the generally unsupported ANCHOR charset for
links). We do not need a UTF-8 BOM which neither declares an encoding nor
identifies a serialization.
5) References to RFC 2279 are depressing. It is overdue for an update as it
references 6 byte transformations.
6) Doug you surprised me! I thought you were a supporter of standards... How
can we have standards while recommending people modify their products to
accommodate whatever characters or innovations suits them. The mistakes of
browser vendors in the past is not a good justification for ad hoc changes
today.
Just as with early Unicode there were some difficulties doing everything you
needed with the web standards. Those days are gone. Let's insist vendors
comply with both W3C and Unicode standards, AS WRITTEN, or the world gets to
be an ugly place to develop software in. I like having one set of web pages
that work on multiple browsers and not having to do separate pages for
different browsers. Please tell me it was just a case of your not having had
your morning coffee yet... ;-)
tex
Doug Ewell wrote:
(I would add to Michka's comment
> that it seems equally bizarre to allow U+000C FORM FEED at the start of
> an HTML file but not U+FEFF.)
>
> > PS: UTF-16 is an exception to that, since the BOM is not part of the
> > document and should be removed for processing.
>
> If this is true -- that U+FEFF is a kind of meta-character that doesn't
> really belong to the text per se -- then it should be equally true for
> UTF-8, whether its role is as a true Byte Order Mark (needed in UTF-16
> and UTF-32 but not UTF-8) or as a signature (potentially useful in all
> Unicode CES's). Only in its evil-twin role as a zero-width no-break
> space is it truly part of the text, in which case the previous
> discussion comments about white-space characters applies.
>
> Michka is probably right that Notepad is one of the more popular HTML
> editors out there, but even though I'm sure he didn't mean it this way,
> I would prefer not to say anything that can be twisted into "the HTML
> specification should be changed to match the way Microsoft does things."
> That is bound to bring all the Microsoft haters out of the woodwork.
> Rather, I would stress the inconsistency of allowing U+FEFF at the
> beginning of an HTML file encoded in UTF-16 but not in one encoded in
> the much more common UTF-8.
> Roozbeh responded:
> RFC 2279 defines and describes the technical structure of UTF-8. Usage
> issues surrounding U+FEFF as either a signature or a ZWNBSP would have
> been out of scope. Most Unicode and WG2 documents do not discuss the
> BOM either.
>
Doug startles me with:
> I don't see anything wrong with IE allowing a BOM at the start of
> UTF-8-encoded HTML files, even if it is not expressly allowed by the
> HTML specification. Browser vendors have certainly gone farther than
> that to "extend" the standard in the past; remember Netscape's notorious
> <blink> element? But I also think the HTML Working Group should
> consider explicitly allowing the BOM at the start of HTML files encoded
> in UTF-8. (Note that it is explicitly allowed in XML.)
>
> -Doug Ewell
> Fullerton, California
-- ------------------------------------------------------------- Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com Xen Master http://www.i18nGuy.com XenCraft http://www.XenCraft.com Making e-Business Work Around the World -------------------------------------------------------------
This archive was generated by hypermail 2.1.5 : Mon Feb 17 2003 - 06:32:24 EST