Re: MS/Unix BOM FAQ again (small fix)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 09 2002 - 22:23:57 EDT

Previous message: Mark Davis: "Re: MS/Unix BOM FAQ again (small fix)"
Maybe in reply to: Shlomi Tal: "MS/Unix BOM FAQ again (small fix)"
Next in thread: Mark Davis: "Re: MS/Unix BOM FAQ again (small fix)"
Next in thread: Doug Ewell: "Re: MS/Unix BOM FAQ again (small fix)"
Reply: Mark Davis: "Re: MS/Unix BOM FAQ again (small fix)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> I agree, there are different ways to look at it. But the statement
>
> > > > A Unicode text file beginning with FEFF is
> > > > big-endian, and a file beginning with FFFE (not a legal Unicode
> > > > character for any other purpose) is little-endian
>
> is just plain wrong, since UTF-32, for example, could start with bytes
> FE FF.

Um, not legally in open interchange.

Either you have big-endian UTF-32 <FE FF nn mm ..> which would correspond
to U-FEFFnnmm ... -- and that is out-of-range for both Unicode and 10646.

Or you have little-endian UTF-32 <FE FF nn 00 ..> which would correspond
to U-00nnFFFE ..., where nn could be 00..10, but all such values are
noncharacters, and cannot be used in open interchange.

So if serialized "Unicode text" starts off <FE FF ...> and purports to be legal,
it cannot be UTF-32, it cannot be UTF-8, and it cannot be little-endian.

--Ken

Previous message: Mark Davis: "Re: MS/Unix BOM FAQ again (small fix)"
Maybe in reply to: Shlomi Tal: "MS/Unix BOM FAQ again (small fix)"
Next in thread: Mark Davis: "Re: MS/Unix BOM FAQ again (small fix)"
Next in thread: Doug Ewell: "Re: MS/Unix BOM FAQ again (small fix)"
Reply: Mark Davis: "Re: MS/Unix BOM FAQ again (small fix)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Wed Apr 10 2002 - 00:16:17 EDT