RE: Eleventh hour check on XML 1.1 names

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Tue Aug 13 2002 - 07:25:48 EDT


On Tue, 13 August 2002, Marco Cimarosti wrote:

>
> John Cowan wrote:
> > The following characters were explicitly permitted by XML 1.0 but are
> > not in the "recommended" 1.1 set:
> >
> [...]
> > U+FEFF ZWNBSP
>
> How do parsers detect the endianness of XML files in UTF-16 (and the very
> fact that they are UTF-16)?

I assume that U+FEFF ZWNBSP is included in this list precisely because it is now used solely with
the semantics of a Byte Order Mark, and its original meaning as ZWNBSP is deprecated in favour of
U+2060 WORD JOINER.

My understanding is that this list only refers to characters that are not permitted within XML
names. The BOM is placed at the head of the XML file, before the XML declaration, and is thus the
first character encountered by the parser.

The parser works out the encoding and endianness of the XML file from the value of the BOM :
0xFEFF = UTF-16 BE
0xFFFE = UTF-16 LE
0x0000FEFF = UTF-32 BE
0xFFFE0000 = UTF-32 LE
0xEFBBBF = UTF-8

Andrew



This archive was generated by hypermail 2.1.2 : Tue Aug 13 2002 - 05:34:08 EDT