Re: BOM ambiguity? from Philippe Verdy on 2012-07-13 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 14 Jul 2012 03:03:12 +0200

No. Because there's no ambiguity for HTML, not even for its newer
version 5 (remember that HTML documents must still start by either a
DTD declaration, or an XML declaration (yes HTML5 exists also with an
XML binding), or an HTML comment, or the "<html ...>" element : all of
them starting by a LOWER THAN sign. The only intermediate characters
that may be accepted are newline controls and spaces, If there's a BOM
it will not match this requirement in any supported encodings (not
even f the documetn was encoded in UTF-32, BE or LE).

Or if the presence of null bytes was a problem, then UTF-16 would also
be forbidden. Note that NULL characters (U+0000) are invalid in HTML
and XML (all schemas and versions). But null bytes are allowed to
support UTF-16.

I think that the support for UTF-32 was cancelled only to limit the
choice of support in implementations and make sure that all of them
will at least fully support the minimum required. UTF-16 was judged to
be more useful than UTF-32, only to support modern Asian scripts with
an encoding that is a bit more compact for these scripts (also because
HTML documents will contain a lot of ASCII characters (U+0020..U+007E)
for the HTML tags, embedded CSS and scripts. Using UTF-32 would we
just waste of space, even for pages fully written in old scripts
encoded outside the BMP (also because there will remain a significant
number of ASCII chars for the tagging) ! UTF-16 will always be better
than UTF-32 in terms of data size (data size is important only for
storing the web page in memory (but even for this case, it is possible
to store in memory only pages in a compressed form/ Web servers can
also vompress pages on the fly before sending it to the network when
replying to clients.

That's not a good reson: the page will be stored in a file of the
underlying filesystem assigned to the user's sessions with the
browers, just for caching the page (becaues local storage will be much
faster. In fact some servers have strict policies for refusing
repeated request to get the sourve page.
generic data compression is always possible, including over the HTTP session.

Forbiding the UTF-32 just means that it won't be available over the
Internet from a third party. But HTML parsers in browsers are not
really so restricted when they are managing their own local cache for
the viewing user. Even hte content of that cache would be encrypted
and undecioherable by other users
of any browser on the same host instance.

So I would just say that the necesary support of UTF-32 was not
required. HTML5 was in fact designed to remove the constant need for
optional extensions (that's why SVG was also integrated in it, unlike
with XHTML 1.0). This has allowed a lot of code optimisation and
helped improving the performances a lot in recent browsers, making it
also available for smaller smartphones with more limited computing
capacities.

But I don't see why a browser would refuse to parse UTF-32 encoded
documents (notably those adopting the XML binding format where it is
explicitly allowed by XML, even if it often requires an XML
declaration).

2012/7/14 John W Kennedy <jwkenne_at_attglobal.net>:
> On Jul 13, 2012, at 4:54 PM, Stephan Stiller wrote:
>> As an aside to the BOM discussion - something I've always been meaning to ask.
>>
>> So there is a BOM-ambiguity when a file starts with
>> FF FE
>> and then a couple of U+0000 characters, yes? Because this could be either UTF-16 or UTF-32 under little-endianness. Has this been pointed out and discussed beforehand?
>>
>> Because the set of BOMs in different encodings don't constitute a prefix-free code.
>
> Isn't this why UTF-32 is forbidden for HTML 5?
Received on Fri Jul 13 2012 - 20:07:30 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 20:07:31 CDT