RE: Unicode and Kermit

From: F. Avery Bishop (Exchange) (averyb@exchange.microsoft.com)
Date: Fri Aug 13 1999 - 14:15:39 EDT


Notepad on Windows 2000 does detect BE BOM and read it correctly. It also
writes the file back out in the same format it was read in by default, but
you can override that to be UTF-8 or BE Unicode if you want.

F. Avery Bishop
Program Manager, International Evangelism
averyb@microsoft.com

-----Original Message-----
From: John Cowan [mailto:cowan@locke.ccil.org]
Sent: Monday, August 09, 1999 9:35 AM
To: Unicode List
Subject: Re: Unicode and Kermit

Frank da Cruz scripsit:

> How about when receiving data to be stored as UCS-2? Which byte order
> should be used BY DEFAULT? If I am receiving the file on (say) Windows 98
> (which is Intel only), should I store it with little-endian byte order?
> Whereas on a Sparc (with any OS) I should write big-endian?

In that case I think the native order should win.
But never write a little-endian file without a BOM.

> When you say "a BOM will be present" does this mean BOMs are mandatory for
> UCS-2/UTF-16 files on Windows 95/98/NT/2000? Is there a reference for
this?

What is "mandatory"? Windows NT Notepad, with which Win32 users should
expect to interoperate, always writes LE with a BOM. Bogusly, it does
not detect a BE BOM and swap.

> For example, do Windows NT on Intel and MIPS (before it was canceled) use
> the same or opposite byte order (MIPS is big-endian)? If they use
opposite
> byte order, what would that mean for file sharing? Ditto for (say) NFS
> mounts between platforms of opposite endianness.

NT on MIPS put the MIPS chip into LE mode.

> In the real world, do UCS-2 files always start with a BOM? Do all
> applications that handle UCS-2 handle the BOM and swap bytes if necessary?

Alas, no (see above). But they should.

> > ... The swapped BOM is a non-character, so it can't appear in a
> > well-formed UTF-16 file. But you shouldn't have to byte swap a UTF-8
> > file that appears to begin with a U+FFFE.
> >
> You shouldn't swap UTF-8 anyway, right? In fact, what's the point of the
> UTF-8 BOM since byte order is not an issue with UTF-8? (And since, after
> all, any file can begin with EF BB BF, or FFFE for that matter...)

In principle, yes. But neither of these is a *probable* sequence.
Some people want to see a UTF-8 BOM so they have more assurance that
a file really is UTF-8, but this is not standardized.

> I suppose on a file system that contains only Unicode text, the BOMs serve
> to identify the transformation format, but on mixed file systems they are
> not a good indicator of anything unless we already know that it's Unicode
> text.

They are a probabilistic indicator of Unicode text, which as such is
very helpful. Windows NT Notepad assumes LE UTF-16 if it sees a LE BOM,
otherwise the current 8-bit code page.

-- 
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT