Re: MS/Unix BOM FAQ again (small fix)

From: Mark Davis (mark@macchiato.com)
Date: Wed Apr 10 2002 - 16:45:04 EDT


Here is what I think the FAQ ought to say:

Suppose you know that the text is Unicode.
- Unicode can be represented in a number of different forms (UTFs)
  - some of them *may* start with a BOM (a byte sequence that would
correspond to U+FEFF).
  - some cannot (in that case, a byte sequence that would correspond
to U+FEFF is really a character, not a BOM)
  - none *must* start with a BOM.

- when one of the BOM-allowing UTFs starts with a BOM, you know the
encoding*, and you strip off the BOM when you get the content.

- otherwise, you can use some heuristics to detect which UTF the text
is in. This depends on the fact that certain byte combinations are
illegal, or would represent unassigned codes. Since the format of the
UTFs is quite constrained, these heuristics are much faster and more
accurate than general encoding-detection heuristics.

*assuming that no UTF-16 file has U+0000 as the first character.

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: <jarkko.hietaniemi@nokia.com>
To: <markus.scherer@jtcsv.com>; <unicode@unicode.org>
Sent: Wednesday, April 10, 2002 12:53
Subject: RE: MS/Unix BOM FAQ again (small fix)

> > If you look for any Unicode signature, then you look for FF
> > FE 00 00 (UTF-32LE) before you check for FF FE (UTF-16LE).
>
> FF FE 00 00 could be the UTF-32LE BOM, but it could also be UTF-16LE
BOM
> followed by a UTF-16 U+0000. Yes, the NULL is usually not thought
of as "text",
> but there's no knowing what data people might be storing in UTF-16.
> So it's back again to either out-of-band information or heuristics.
>
>
>
>
>



This archive was generated by hypermail 2.1.2 : Wed Apr 10 2002 - 15:15:46 EDT