Re: MS/Unix BOM FAQ again (small fix)

From: Shlomi Tal (shlompi@hotmail.com)
Date: Wed Apr 10 2002 - 00:58:12 EDT


>UTF-8 does not need a byte order mark per se, of course, but in certain
>environments it may benefit from a signature. This point is regularly
>missed by those who view BOMs in UTF-8 text as needless junk, and
>Microsoft's use of this marker as evil (not necessarily Shlomi's
>opinion, but certainly that of Markus Kuhn and many other Linux
>faithful).
>
>It is true that Unix/Linux systems, even those that are configured to
>use UTF-8, often expect the first 'n' bytes of a file to identify the
>file type (e.g. "#!"), and may not work correctly if the file starts
>with 0xEF 0xBB 0xBF instead. But representing the issue as "U+FEFF is a
>byte-order mark, therefore UTF-8 files don't need it" sheds no light on
>the reasons why some vendors choose to include it.

Well, I did receive comment from Linux user about how the UTF-8 BOM is meant
as a magic file type, not as a mark for the byte order, but such tagging of
plain-text files does sound weird even to me, who use Windows 2000 most of
the time. I may be completely off the mark here, but tagging of plain-text
files seems to me an eery reminder of the ISO-2022 escape sequences. In
UTF-16 it's sensible because the OS needs to interpet the byte values
differently (eg "black heart suit" instead of "ampersand, small Latin letter
E").

> > Web pages usually use UTF-8, and although they can handle the BOM,
> > it may appear as a strange character (a blank square or a question
> > mark) on a browser that doesn't recognize it, and may also cause
> > the above troubles when the file is saved to the local disk.
>
>There is no reason for a Unicode-compliant browser to display "a blank
>square or a question mark" for U+FEFF instead of a zero-width no-break
>space. U+FEFF is one of the better-known Unicode characters and
>legitimately has the ZWNBSP semantic, and will (perhaps regrettably)
>continue to have it even after U+2060 WORD JOINER becomes widely
>recognized as the preferred character for that function.

If Mozilla has already fixed that, then good. I have Mozilla on Win2K, but
on my Linux partition I have only Netscape 4.7, and that one does mishandle
the BOM.

> > old 8-bit "ANSI" (Microsoft's non-standard name for its 8-bit
> > Windows codepages
>
>If it were up to me, I would dispense with the gratuitous, 15-year-old
>jab at Microsoft for calling the Windows code pages "ANSI." Their
>reasons for doing so have been documented often. Putting "ANSI" in
>quotation marks might have been sufficient. But I understand that this
>FAQ is intended for a Unix/Linux audience, and that may simply be the
>price of admission.

It wasn't in the original, and I don't have any particular grudge against
Microsoft. I added it after Markus Kuhn pointed out that the Microsoft term
"ANSI" was a misnomer. I'd rather avoid the term altogether, but it is
ubiquitous in Windows 2000/XP.

> > Since UTF-16 text files are not meant for open transfer anyway,
> > this is not an important issue. As for database applications and
> > other situations where text files are merged, a Unicode-aware
> > application should be able to discard all following U+FEFF
> > characters.
>
>The reference to UTF-16 being "not meant for open transfer" and the
>statement about discarding non-initial U+FEFF are not strictly correct
>(because non-initial U+FEFF could also be ZWNBSP), but in the limited
>context of this FAQ they are probably harmless.

I have UTF-16 text files on my machine, but all over the Web and e-mail and
newsgroups exchange you won't see anything but UTF-8. That's what I meant by
"open transfer".

>Finally, I agree with Shlomi that the references to UTF-7 are "not of
>any importance," to the point where I am not sure why UTF-7 is even
>mentioned. As Shlomi points out, Microsoft products do not treat UTF-7
>specially, except that IE recognizes the UTF-7 BOM and sets its encoding
>accordingly (but this is true for any UTF-7 sequence, not just the BOM;
>try loading a text file containing only the 11 ASCII characters
>"M+APw-nchen").

I mentioned UTF-7 (as opposed to UTF-1, which is really obsolete) because it
still appears in the environments: mailers (Outlook & co), browsers
(Netscape, Mozilla, also an option in MSIE if you add the META tag to call
it), conversion routines (Win2K cmd.exe handles UTF-7 when you do "chcp
65000"). I haven't seen any frequent use of it, but mailers definitely still
support it, as I verified in a post of mine to misc.tests.

About "M+APw-nchen", are you quite sure? I drag a text file containing this
string into Internet Explorer 5.0, and it doesn't display the UTF-7
converted. It displays "small Latin letter u with diaraesis" correctly when
the text file contains the UTF-7 BOM: "+/v8-M+APw-nchen".

_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.



This archive was generated by hypermail 2.1.2 : Wed Apr 10 2002 - 01:54:50 EDT