Re: MS/Unix BOM FAQ again (small fix)

From: Mark Davis (mark@macchiato.com)
Date: Tue Apr 09 2002 - 21:26:07 EDT


I agree, there are different ways to look at it. But the statement

> > > A Unicode text file beginning with FEFF is
> > > big-endian, and a file beginning with FFFE (not a legal Unicode
> > > character for any other purpose) is little-endian

is just plain wrong, since UTF-32, for example, could start with bytes
FE FF.

If you have no information at all about the encoding, then you have to
use some interesting heuristics to try to determine the encoding. If
you know at least that it is Unicode, and you know that it is one of
the UTFs that can take a BOM (UTF-8, UTF-16, UTF-32), then it is
easier -- except that none of those forms *have* to have a BOM.

Mark

—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Andy Heninger" <andyh@jtcsv.com>
To: <unicode@unicode.org>
Sent: Tuesday, April 09, 2002 15:43
Subject: Re: MS/Unix BOM FAQ again (small fix)

> It looks to me like Shlomi's chart and Mark's chart for
interpretting
> the BOMs are describing slightly different situations.
>
> Mark's table assumes that you have the BOM and some other additional
> indication of the data's encoding - a charset= declaration, or an
> xml encoding declaration, or whatever. The chart will
> then indicates whether the BOM is consistent with the declared
> encoding and whether a ZWNBSP should be retained.
>
> The other table would make sense for use on data when no other
> indication of the indication of the encoding is available. A
> distinction between, for example, UTF-16 and UTF16-LE is not
> possible.
>
> -- Andy Heninger
> heninger@us.ibm.com
>
>
> Mark Davs wrote
> > Shlomi Tal write
> > > A Unicode text file beginning with FEFF is
> > > big-endian, and a file beginning with FFFE (not a legal Unicode
> > > character for any other purpose) is little-endian.
> >
> > This is incorrect. Here is a summary of the meaning of those bytes
at
> > the start of text files with different Unicode encoding forms.
> >
> > beginning with bytes FE FF:
> > - UTF-16 => big endian, omitted from contents
> > - UTF-16BE => ZWNBSP
> > - UTF-16LE, UTF-8, UTF-32, UTF-32BE, UTF32LE => malformed, file
> > corrupted
> >
> > beginning with bytes FF FE:
> > - UTF-16 => little endian, omitted from contents
> > - UTF-16LE => ZWNBSP
> > - UTF-32 => little endian (if followed by bytes 00 00), omitted
from
> > contents
> > - UTF-32LE => different code points, depending on following bytes
> > - UTF-16BE, UTF-8, UTF-32BE => malformed, file corrupted
> >
>
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Apr 09 2002 - 22:22:23 EDT