On Tue, 13 August 2002, Marco Cimarosti wrote:
>
> John Cowan wrote:
> > The following characters were explicitly permitted by XML 1.0 but are
> > not in the "recommended" 1.1 set:
> >
> [...]
> > U+FEFF ZWNBSP
>
> How do parsers detect the endianness of XML files in UTF-16 (and the very
> fact that they are UTF-16)?
I assume that U+FEFF ZWNBSP is included in this list precisely because it is now used solely with
the semantics of a Byte Order Mark, and its original meaning as ZWNBSP is deprecated in favour of
U+2060 WORD JOINER.
My understanding is that this list only refers to characters that are not permitted within XML
names. The BOM is placed at the head of the XML file, before the XML declaration, and is thus the
first character encountered by the parser.
The parser works out the encoding and endianness of the XML file from the value of the BOM :
0xFEFF = UTF-16 BE
0xFFFE = UTF-16 LE
0x0000FEFF = UTF-32 BE
0xFFFE0000 = UTF-32 LE
0xEFBBBF = UTF-8
Andrew
This archive was generated by hypermail 2.1.2 : Tue Aug 13 2002 - 05:34:08 EDT