Re: UTF-8 isn't the default for HTML (was: xkcd: LTR) from Philippe Verdy on 2012-11-29 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 29 Nov 2012 19:11:42 +0100

- Method 1 (the BOM) is only goof for UTF-16. not reliable for UTF-8 whuch
is still the default for XHTML (and where the BOM is not always present).
- Method 2 is working sometimes, but is not practicle for many servers that
you can't configure to change their content-type for specific pages all
having the same *.html extension or relayed by some proxies, it is also
dependant on the transport layer (HTTP here) to be capbel of offering it
(HTML files in file systems do not provide the info). Bit if it is
implemented it will take precedence, possibly indicating that the document
was reencoded (by a proxy for example).
- Method 3 and 4 are completely equivalent and share the same problem :
they require restarting the parsing. They are equally ugly (just like all
empty meta elements in the HTML header or in the body) intriducing another
attribute to the meta element (which already has name, http-equiv, and now
charset) is also a bad idea (data encoded in attributes that are part of
the document root, breaks the concept of what is metadata); it also forbids
the reencoding of the document during processing, if the document is
digitally signed for its content, independantly of its encoding: to check
the document signature, you would not only have to parse it completely up
to the DOM level, but also ignore these specific meta elements (but not all
meta elements like links)

- Method 5 is where ?

- Method 6 (sniffing) is a transitory solution (as long as HTML5 is not
released) or last chance paliative solution based only on an heuristic,
which fails sometimes. Not reliable.

- Method 7 (using the XML prolog) is excellent for XML. It will reliably
work with XHTML5, without needing reparsing.

- Method 8 (content-type set as "application/xhtml+xml" in the transport
layer) is exactly like method 2 (and suffers the same problem), but the
content-type is not really intended for HTML5, not even XHTML5 as it
implies an application and the extensible schema that XHTML5 will not
parse. Method 8 for me implies the forced use of an XML parser, not an HTML
parser. All XML extensions (including namespaces) will be valid

My method is a generalisation to HTML of the excellent method 7 for XHTML
(based on its standard and the XML standard). It requires absolutely no
reparsing, and supports the explicit versioning of HTML (for future
evolutions of its supported schema), without overwriting the independant
versioning of XML if it is used. As well it does not require the new ugly
DOCTYPE which indicates absolutely nothing signiicant, will not allow
versioning, and breaks SGML parsers as well as XML parsers. It takes
benefit of the fact that they don't break browsers in method 7 (even if
some of them do not sniff at least the encoding from the XML prolog).

2012/11/29 Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>

> Philippe Verdy, Thu, 29 Nov 2012 16:10:14 +0100:
> > Thanks a lot, this was really hard to see and understand, because I
> > was only reading the XHTML specs, and the Validator did not complain.
>
> Glad to find we are no the same page!
>
> Philippe Verdy, Thu, 29 Nov 2012 16:27:13 +0100:
> > <?html version="5.0" encoding="utf-8">
>
> HTML5 already have 4 *conforming* methods for setting the UTF-8
> encoding:
>
> 1. byte-order mark
> 2. HTTP server,
> Content-Type:text/html;charset=UTF-8
> 3. meta http-equiv,
> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/>
> 4. meta charset,
> <meta charset="UTF-8"/>
> (Note that there is no content-type here, and thus the meta charset
> method is more "clean" to use in a file served as XHTML.)
>
> In addition, other things have effect:
>
> 6. Sniffing is an official, but largely unimplemented method for
> getting the encoding (Chrome and Opera use it, and Firefox
> has it as an option and also uses it by default for some locales.)
> 7. The XML prologue (sic) takes effect in *some* browsers.
> 8. Simply serving the page as application/xhtml+xml is
> yet another method of setting the encoding to UTF-8.
>
> Thus I can guarantee you that your idea about at method number 9, is
> not going to be met with enthusiasm.
> --
> leif halvard silli
>
Received on Thu Nov 29 2012 - 12:16:18 CST

This archive was generated by hypermail 2.2.0 : Thu Nov 29 2012 - 12:16:20 CST