Re: pre-HTML5 and the BOM from Martin J. Dürst on 2012-07-17 (Unicode Mail List Archive)

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Tue, 17 Jul 2012 18:46:03 +0900

On 2012/07/14 1:33, Philippe Verdy wrote:
>> Fra: Jukka K. Korpela<jkorpela_at_cs.tut.fi>
>>> "When the BOM is used in web pages or editors for UTF-8 encoded content it
>>> can sometimes introduce blank spaces or short sequences of strange-looking
>>> characters (such as ï»¿). For this reason, it is usually best for
>>> interoperability to omit the BOM, when given a choice, for UTF-8 content."
>>>
>>> http://www.w3.org/International/questions/qa-byte-order-mark

> This statemant for maximum interoperability may have been true in the
> past, where Unicode support was not so universal and still not adopted
> formally for all newer developments in RFCs published by the IETF. But
> now the situation is reversed : maximum interoperability if offered
> when BOMs are present, not really to indicate the byte order itself,
> but to confirm that the content is Unicode encoded and extremely
> likely to be text content and not arbitrary binary contents (that
> today almost always use a distinctive leading signature).

As you mention the IETF, what people in the IETF like most about UTF-8
is that it's upward-compatible with ASCII. Because the
protocol/syntax-relevant part is usually ASCII only, that means that a
lot of stuff can work just by making things 8-bit clean (which in this
day and age may mean essentially no work in some cases).

A BOM anywhere in a protocol therefore just removes the biggest
advantage of UTF-8. While it's usually okay to use a BOM at the start of
a whole file (or the file equivalent in transmission, which is a MIME
entity), anywhere else (e.g. in small protocol fields), a BOM is a big
no-no.

Regards, Martin.
Received on Tue Jul 17 2012 - 04:48:54 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 04:48:55 CDT