From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sat Sep 23 2006 - 02:08:19 CDT
On Fri, 22 Sep 2006, Addison Phillips wrote:
> See: http://www.w3.org/International/questions/qa-utf8-bom
That page is not very specific in its statements about browser behavior.
It discusses BOM handling in both browsers and editors and mainly the
appearance of BOM at the start of data.
Indirectly, the statement "Note that a number of more recent browsers,
such as the latest versions of Internet Explorer (Win), Mozilla (Netscape)
and Opera, do not exhibit this behavior." seems to say that BOM in UTF-8
is not much of a problem in most browsing situations.
Checking some less common browsers, I noticed that Netscape 4.5 shows the
BOM as a square box (probably because it's trying to render it as a
visible character), and Lynx 2.8.5 shows it as the inverted question mark
character ¿ (don't ask me why - the browser can handle UTF-8 in general,
though it is often used in environments where the browser can _display_
e.g. ISO Latin 1 characters only).
(By the way, the page contains two descriptions of what an UTF-8 encoded
BOM looks like when interpreted as UTF-8. The first one, , in the first
paragraph is correct, whereas the second occurrence, ï«¿, has got the
guillemet changed.)
> The BOM is often rendered in the page, throwing off other display elements.
I can't agree with the "often" adverb. And I didn't see any empty lines,
though I saw some other faulty renderings.
> While one might expect
> this to act as a "no-op" character, in practice, it isn't.
We might expect the BOM, i.e. U+FEFF, inside data to act as a control
character according to its old Unicode semantics, which has been retained
although the use of U+FEFF for that purpose has been deprecated in
favor of word joiner U+2060. That is, data should not contain U+FEFF
except at the start of data as a BOM, but programs should still interpret
it in a specific way.
Then again, HTML specifications do not require browsers to observe Unicode
semantics for characters in general. In fact, Internet Explorer, for
example, fails to do so for U+FEFF inside text. The browser does not try
to render the character in any visible way, which is good, but it does not
interpret it as forbidding line breaks before and after it. That's too
bad, since if it did, we would have a standards-conforming and relatively
safe way of forbidding a line break after a hyphen-minus, for example.
(Using the nonbreaking hyphen character is not a realistic option, because
it creates problems far too often, due to its absence in most fonts.)
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Sat Sep 23 2006 - 02:22:12 CDT