Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Mon, 16 Jul 2012 18:24:25 +0200

Steven Atreju, Mon, 16 Jul 2012 13:35:04 +0200:
> "Doug Ewell" <doug_at_ewellic.org> wrote:

> And:
>
> Q: Is the UTF-8 encoding scheme the same irrespective of whether
> the underlying processor is little endian or big endian?
> ...
> Where a BOM is used with UTF-8, it is only used as an ecoding
> signature to distinguish UTF-8 from other encodings — it has
> nothing to do with byte order.
>
> Fifteen years ago i think i would have put effort in including the
> BOM after reading this, for complete correctness! I'm pretty sure
> that i really would have done so.

I believe that most people that are conscious about inserting the BOM,
do so because, without it, then Web browsers (with Chrome as the
exception, whenever the page contains non-ASCII characters, at least)
are unlikely to sniff a UTF-8 encoded page to be UTF-8 encoded. So, it
has nothing with "complete correctness" to do, but everything to do
with complete safety.

> So, given that this page ranks 3 when searching for «utf-8 bom»
> from within Germany i would 1), fix the «ecoding» typo and 2)
> would change this to be less «neutral». The answer to «Q.» is
> simply «Yes. Software should be capable to strip an encoded BOM
> in UTF, because some softish Unicode processors fail to do so when
> converting in between different multioctet UTF schemes. Using BOM
> with UTF-8 is not recommended.»

The current text is much to prefer. Also, you place the wagon before
the horse. You place tools over users.

There is one reason to use UTF-8 BOM which that FAQ point doesn't
mention, however, and that is that Chrome/Safari/Webkit plus IE treat a
UTF-8 encoded text/html page with a BOM different from a UTF-8 encoded
text/html page without a BOM - even when the page is otherwise properly
labelled as UTF-8. For the former, then the user would not be able to
override the encoding, manually. Whereas for a page without the BOM,
then the user can override the encoding/shoot themselves (and others)
in the foot.

> And UTF-8 got an additional «wohooo - i'm Unicode text» signature
> tag, though optional. I like the term «extremely rare» sooo much!!
> :-)

What's the problem?

> If you know how to deal with UTF-8, you can deal with UTF-8.
> If you don't, no signature ever will help you, no?!

Do you mean that, instead of the wohoo, one should do more thorough
sniffing? I have no insight into how reliable such non-BOM-sniffing is.
But I take it that it is much less secure than BOM-sniffing. Hence it
would be risky (?) to deny users to override the encoding of a
non-BOM-sniffed page. Which, bottom line, means that the BOM got an
advantage.

> If you don't know the charset of some text, that comes from
> nowhere, i.e., no container format with meta-information, no
> filetype extension with implicit meta-information, as is used on
> Mac OS and DOS, then UTF-8 is still very easily identifieable by
> itself due to the way the algorithm is designed. Is it??

As I just said in a reply to Doug: Of the Web browsers in current use,
Chrome is the very best. This is, I think, because it, to a higher
degree than the competition, assumes UTF-8 whenever it finds non-ASCII
characters. Clearly, sniffing could improve. At least in the browser
world. But is that also true for command lines tools?

-- 
Leif H Silli
Received on Mon Jul 16 2012 - 12:21:58 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 16 2012 - 12:21:59 CDT