Re: pre-HTML5 and the BOM from Leif Halvard Silli on 2012-07-18 (Unicode Mail List Archive)

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Wed, 18 Jul 2012 10:21:57 +0200

Martin,

"Martin J. Dürst", Wed, 18 Jul 2012 10:05:40 +0900:
> On 2012/07/18 4:35, Leif Halvard Silli wrote:
>> But is the Windows Notepad really to blame?
>
> Pretty much so. There may have been other products from Microsoft
> that also did it, but with respect to forcing browsers and XML
> parsers to accept an UTF-8 BOM as a signature, Notepad was definitely
> the main cause, by far.
>
>> OK, it was leading the way.
>> But can we think of something that could have worked "better", in
>> praxis? And, no, I don't mean 'better' as in 'not leaking the BOM into
>> HTML'. I mean 'better' as in 'spreading the UTF-8 to the masses'.
>
> UTF-8 is easy and cheap to detect heuristically. It takes a bit more
> work to scan the whole file than to just look at the first few bytes,
> but then I don't think anybody is/was editing 1MB files in Notepad.
> So the BOM/signature is definitely not the reason that UTF-8 spread
> on the Web and elsewhere.

(The file length issue is an issue on the Web too.)

> The spread of UTF-8 is due to its strict US-ASCII compatibility.
> Every US-ASCII character/byte represents the same character, and only
> that character, in UTF-8. A plain ASCII file is an UTF-8 file. If
> syntax-significant characters are ASCII, then (close to) nothing may
> need to change when moving from a legacy encoding to UTF-8. On top of
> that, character synchronization is very easy because leading bytes
> and trailing bytes have strictly separate values. From that
> viewpoint, the BOM is a problem rather than a solution.

I was thinking about NotePad: What else could NotePad have done - other
than be turned into another program = delaying the entire UTF-8
support? The closest to NotePad on OS X is probably TextEdit. On my OS
X 10.5 computer, TextEdit does not sniff UTF-8 unless there is a BOM.
Which means that TextEdit defaults to saving to UTF-8 (at least when
the situation calls for it), however it does so without including the
BOM. Which means that TextEdit fails to re-open the file as UTF-8.

On my OS X 10.7 computer, then TextEdit does sniff UTF-8 (without the
BOM).

Someone mentioned 'the cost of doing business'. And you have pointed
out that it takes time to realize ... That NotePad could have done
something else, seems to me to be quite hypothetical.

PS: I have tried to argue (in a bug report) that Webkit should default
to UTF-8, including using UTF-8 detection. But I was shot down with the
words that Webkit should work as all other browsers. So it seems one
needs a 'notepad' - such as Chrome - to lead the way.

> I think that a browser fully dedicated to HTML4 but not intending to
> implement HTML5 will eventually die out. If it exists today, it would
> indeed be reasonable to accept the BOM. But that's not because
> reading the spec(s) leads to that as the only conclusion, it's
> because there's content out there that starts with a BOM.

It seems we agree that in 2012, 'pre-HTML5 browsers' can not be an
argument that should cause a warning in the W3 HTML validator or in W3
documents.

-- 
Leif Halvard Silli

Received on Wed Jul 18 2012 - 03:34:03 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 18 2012 - 03:34:03 CDT