Re: Unicode and Kermit

From: John Cowan (cowan@locke.ccil.org)
Date: Mon Aug 09 1999 - 13:05:25 EDT

Next message: lisam@us.ibm.com: "Reminder - 15th International Unicode Conference"
Previous message: Mark Leisher: "Re: ISIRI-3342 to UNICODE"
In reply to: Frank da Cruz: "Re: Unicode and Kermit"
Next in thread: Mark Davis: "Re: Unicode and Kermit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Frank da Cruz scripsit:

> How about when receiving data to be stored as UCS-2? Which byte order
> should be used BY DEFAULT? If I am receiving the file on (say) Windows 98
> (which is Intel only), should I store it with little-endian byte order?
> Whereas on a Sparc (with any OS) I should write big-endian?

In that case I think the native order should win.
But never write a little-endian file without a BOM.

> When you say "a BOM will be present" does this mean BOMs are mandatory for
> UCS-2/UTF-16 files on Windows 95/98/NT/2000? Is there a reference for this?

What is "mandatory"? Windows NT Notepad, with which Win32 users should
expect to interoperate, always writes LE with a BOM. Bogusly, it does
not detect a BE BOM and swap.

> For example, do Windows NT on Intel and MIPS (before it was canceled) use
> the same or opposite byte order (MIPS is big-endian)? If they use opposite
> byte order, what would that mean for file sharing? Ditto for (say) NFS
> mounts between platforms of opposite endianness.

NT on MIPS put the MIPS chip into LE mode.

> In the real world, do UCS-2 files always start with a BOM? Do all
> applications that handle UCS-2 handle the BOM and swap bytes if necessary?

Alas, no (see above). But they should.

> > ... The swapped BOM is a non-character, so it can't appear in a
> > well-formed UTF-16 file. But you shouldn't have to byte swap a UTF-8
> > file that appears to begin with a U+FFFE.
> >
> You shouldn't swap UTF-8 anyway, right? In fact, what's the point of the
> UTF-8 BOM since byte order is not an issue with UTF-8? (And since, after
> all, any file can begin with EF BB BF, or FFFE for that matter...)

In principle, yes. But neither of these is a *probable* sequence.
Some people want to see a UTF-8 BOM so they have more assurance that
a file really is UTF-8, but this is not standardized.

> I suppose on a file system that contains only Unicode text, the BOMs serve
> to identify the transformation format, but on mixed file systems they are
> not a good indicator of anything unless we already know that it's Unicode
> text.

They are a probabilistic indicator of Unicode text, which as such is
very helpful. Windows NT Notepad assumes LE UTF-16 if it sees a LE BOM,
otherwise the current 8-bit code page.

-- 
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin

Next message: lisam@us.ibm.com: "Reminder - 15th International Unicode Conference"
Previous message: Mark Leisher: "Re: ISIRI-3342 to UNICODE"
In reply to: Frank da Cruz: "Re: Unicode and Kermit"
Next in thread: Mark Davis: "Re: Unicode and Kermit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT