From: Doug Ewell (doug@ewellic.org)
Date: Wed Oct 06 2010 - 12:58:28 CDT
Bjoern Hoehrmann <derhoermi at gmx dot net> wrote:
> ... However, there are no or insufficient recommendations when
> protocols should allow [U+FEFF signatures], and which of the many
> signatures should be recognized when performing auto-detection.
I assume you have read http://unicode.org/faq/utf_bom.html#BOM .
Increasingly, protocols tend to discourage or forbid the use of U+FEFF
signatures, either to achieve poor-man's compatibility with 8-bit legacy
applications (like shell scripts), or out of fears that two encoding
declarations in the same document (e.g. U+FEFF signature plus XML
"encoding") might disagree.
This type of objection to in-band tagging mechanisms tends to assume
that all worthwhile data is in a high-level markup format, or that
processing these sequences is too difficult for 21st-century software.
> Furthermore, the signatures are ambiguous.
The only ambiguity I can think of is where "little-endian UTF-16 BOM
followed by U+0000" can be confused with "little-endian UTF-32 BOM."
Most text strings do not begin with U+0000, so even this case is more of
a theoretical problem than a real one.
There are several possible byte sequences for the UTF-7 signature, but
this is more of an inconvenience than an ambiguity. UTF-7 signatures
tend to appear more in comprehensive tables of signatures than in actual
content.
> This has lead to a situation where protocols vary considerably leading
> to interoperability failures and potential security problems. For
> instance, it is common for XML processors to support UTF-32 and detect
> it properly, while other formats, like "HTML5" require treating
> documents with a UTF-32 LE signature as UTF-16 LE. Yet other formats,
> like JSON, are textual in nature and permit only various Unicode
> encodings, but do not permit the BOM.
HTML5, at least, deliberately forbids the use of certain encodings (like
SCSU) and auto-detection of others (like UTF-32), not only to prevent
cross-site scripting attacks, but out of a belief that supporting them
"just wastes developer time." See
http://lists.w3.org/Archives/Public/public-html-comments/2008Jan/0032.html
to see this viewpoint expressed by an HTML Working Group participant.
-- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
This archive was generated by hypermail 2.1.5 : Wed Oct 06 2010 - 13:03:48 CDT