From: Tom Gewecke (tom@bluesky.org)
Date: Mon Feb 17 2003 - 08:57:43 EST
>If this is true -- that U+FEFF is a kind of meta-character that doesn't
>really belong to the text per se -- then it should be equally true for
>UTF-8, whether its role is as a true Byte Order Mark (needed in UTF-16
>and UTF-32 but not UTF-8) or as a signature (potentially useful in all
>Unicode CES's). Only in its evil-twin role as a zero-width no-break
>space is it truly part of the text, in which case the previous
>discussion comments about white-space characters applies.
For what it is worth, the XML doc
(http://www.w3.org/TR/2000/REC-xml-20001006#sec-documents) says this about
the BOM:
>Entities encoded in UTF-16 must begin with the Byte Order Mark ... This is
>an >encoding signature, not part of either the markup or the character data
>of the XML document. XML processors must be able to use this character to
>>differentiate between UTF-8 and UTF-16 encoded documents.
The implication seems to be that in XML, at least, UTF-8 will not have a
BOM (or an encoding declaration). Other parts of the doc, especially
Appendix F, seem to recognize that anything can come either with or without
a BOM. Anything not either UTF-8 or UTF-16 must have an encoding
declaration as well.
This archive was generated by hypermail 2.1.5 : Mon Feb 17 2003 - 09:47:09 EST