Shlomi Tal <shlompi@hotmail.com> wrote:
> Microsoft Unicode Text File Byte Order Mark (BOM) FAQ
> ...
> There is another, very common Unicode encoding scheme called UTF-8,
> which maps the Unicode repertoire into sequences of bytes. Since
> the order of bytes (as opposed to words of more than one byte) is
> the same for all processors, UTF-8 does not require a BOM. It can
> have one, though.
Shlomi explains the "signature" function of the BOM much later in his
FAQ, but just to summarize, U+FEFF in its role as BYTE ORDER MARK -- as
opposed to ZERO-WIDTH NO-BREAK SPACE (not "Non-Breaking") -- has two
(overlapping) purposes:
* as a true byte order mark
* as a text format signature
These two uses are explained in TUS 3.0, Section 13.6, "Specials" (p.
324).
UTF-8 does not need a byte order mark per se, of course, but in certain
environments it may benefit from a signature.  This point is regularly
missed by those who view BOMs in UTF-8 text as needless junk, and
Microsoft's use of this marker as evil (not necessarily Shlomi's
opinion, but certainly that of Markus Kuhn and many other Linux
faithful).
It is true that Unix/Linux systems, even those that are configured to
use UTF-8, often expect the first 'n' bytes of a file to identify the
file type (e.g. "#!"), and may not work correctly if the file starts
with 0xEF 0xBB 0xBF instead.  But representing the issue as "U+FEFF is a
byte-order mark, therefore UTF-8 files don't need it" sheds no light on
the reasons why some vendors choose to include it.
> Web pages usually use UTF-8, and although they can handle the BOM,
> it may appear as a strange character (a blank square or a question
> mark) on a browser that doesn't recognize it, and may also cause
> the above troubles when the file is saved to the local disk.
There is no reason for a Unicode-compliant browser to display "a blank
square or a question mark" for U+FEFF instead of a zero-width no-break
space.  U+FEFF is one of the better-known Unicode characters and
legitimately has the ZWNBSP semantic, and will (perhaps regrettably)
continue to have it even after U+2060 WORD JOINER becomes widely
recognized as the preferred character for that function.
> old 8-bit "ANSI" (Microsoft's non-standard name for its 8-bit
> Windows codepages
If it were up to me, I would dispense with the gratuitous, 15-year-old
jab at Microsoft for calling the Windows code pages "ANSI."  Their
reasons for doing so have been documented often.  Putting "ANSI" in
quotation marks might have been sufficient.  But I understand that this
FAQ is intended for a Unix/Linux audience, and that may simply be the
price of admission.
> Since UTF-16 text files are not meant for open transfer anyway,
> this is not an important issue. As for database applications and
> other situations where text files are merged, a Unicode-aware
> application should be able to discard all following U+FEFF
> characters.
The reference to UTF-16 being "not meant for open transfer" and the
statement about discarding non-initial U+FEFF are not strictly correct
(because non-initial U+FEFF could also be ZWNBSP), but in the limited
context of this FAQ they are probably harmless.
Finally, I agree with Shlomi that the references to UTF-7 are "not of
any importance," to the point where I am not sure why UTF-7 is even
mentioned.  As Shlomi points out, Microsoft products do not treat UTF-7
specially, except that IE recognizes the UTF-7 BOM and sets its encoding
accordingly (but this is true for any UTF-7 sequence, not just the BOM;
try loading a text file containing only the 11 ASCII characters
"M+APw-nchen").
-Doug Ewell
 Fullerton, California
This archive was generated by hypermail 2.1.2 : Wed Apr 10 2002 - 00:45:03 EDT