From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sun Sep 24 2006 - 23:45:59 CST
On Sun, 24 Sep 2006, Doug Ewell wrote:
> A process that claims to be able to "support Unicode"
> should at least be able to follow the simple rule, "If the file or stream
> starts with EF BB BF, throw them away and treat the remainder of the file or
> stream as UTF-8."
No, that would be incorrect if the character encoding of the data has been
declared. It would be a mistake to start interpreting the octets of data
in a manner othen than the declared encoding, at least as long as the data
is formally correct according to the encoding. If the declared encoding
is, say, ISO-8859-1, then EF BB BF has a well-defined meaning that has
absolutely nothing to do with BOM. Even if the data happens to violate a
higher-level protocol, such as HTML specification, it would be wrong to
interpret it at the character level in a manner that violates fundamental
protocols.
> Even the W3C FAQ says: "In some browsers, the presence of a UTF-8 signature
> will cause the browser to interpret the text as UTF-8 regardless of any
> character encoding declarations to the contrary." That's exactly what it
> should do.
No, it's definitely something that browsers must not do when the character
encoding has been declared, as it should, by the protocols. In the absence
of declaration of encoding in any manner (HTTP header, meta tag, etc.),
the browser may guess, and will, for obvious reasons. _Then_ the octet
EF BB BF at the start of data may and should be treated as a good reason
to make the heuristic guess that the data is UTF-8 encoded.
>
> The argument about accidentally throwing away a U+FEFF that was intended as a
> ZWNBSP is becoming increasingly irrelevant;
I'm not sure exactly which argument you are referring to. When performing
file insertion via SSI or otherwise, it is certainly safe and
recommendable to drop an eventual U+FEFF if it appears at the start of an
included file. There's hardly any argument about this, though there might
be practical problems in implementing (depending on how much control you
have over the insertion mechanism).
> U+2060 has been recommended over
> ZWNBSP for over 4 years now, and few applications used ZWNBSP anyway.
I'm afraid U+2060 is not widely supported, to put it mildly.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Sun Sep 24 2006 - 23:56:02 CST