Re: MS/Unix BOM FAQ again (small fix)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 09 2002 - 22:23:57 EDT


> I agree, there are different ways to look at it. But the statement
>
> > > > A Unicode text file beginning with FEFF is
> > > > big-endian, and a file beginning with FFFE (not a legal Unicode
> > > > character for any other purpose) is little-endian
>
> is just plain wrong, since UTF-32, for example, could start with bytes
> FE FF.

Um, not legally in open interchange.

Either you have big-endian UTF-32 <FE FF nn mm ..> which would correspond
to U-FEFFnnmm ... -- and that is out-of-range for both Unicode and 10646.

Or you have little-endian UTF-32 <FE FF nn 00 ..> which would correspond
to U-00nnFFFE ..., where nn could be 00..10, but all such values are
noncharacters, and cannot be used in open interchange.

So if serialized "Unicode text" starts off <FE FF ...> and purports to be legal,
it cannot be UTF-32, it cannot be UTF-8, and it cannot be little-endian.

--Ken



This archive was generated by hypermail 2.1.2 : Wed Apr 10 2002 - 00:16:17 EDT