Re: MS/Unix BOM FAQ again (small fix)

From: Andy Heninger (andyh@jtcsv.com)
Date: Tue Apr 09 2002 - 18:43:58 EDT


It looks to me like Shlomi's chart and Mark's chart for interpretting
the BOMs are describing slightly different situations.

Mark's table assumes that you have the BOM and some other additional
indication of the data's encoding - a charset= declaration, or an
xml encoding declaration, or whatever. The chart will
then indicates whether the BOM is consistent with the declared
encoding and whether a ZWNBSP should be retained.

The other table would make sense for use on data when no other
indication of the indication of the encoding is available. A
distinction between, for example, UTF-16 and UTF16-LE is not
possible.

   -- Andy Heninger
      heninger@us.ibm.com

Mark Davs wrote
> Shlomi Tal write
> > A Unicode text file beginning with FEFF is
> > big-endian, and a file beginning with FFFE (not a legal Unicode
> > character for any other purpose) is little-endian.
>
> This is incorrect. Here is a summary of the meaning of those bytes at
> the start of text files with different Unicode encoding forms.
>
> beginning with bytes FE FF:
> - UTF-16 => big endian, omitted from contents
> - UTF-16BE => ZWNBSP
> - UTF-16LE, UTF-8, UTF-32, UTF-32BE, UTF32LE => malformed, file
> corrupted
>
> beginning with bytes FF FE:
> - UTF-16 => little endian, omitted from contents
> - UTF-16LE => ZWNBSP
> - UTF-32 => little endian (if followed by bytes 00 00), omitted from
> contents
> - UTF-32LE => different code points, depending on following bytes
> - UTF-16BE, UTF-8, UTF-32BE => malformed, file corrupted
>



This archive was generated by hypermail 2.1.2 : Tue Apr 09 2002 - 19:32:35 EDT