Re: [unicode] More ways to encode U+FEFF (was: Re: Designing a multilin

From: Doug Ewell (dewell@compuserve.com)
Date: Thu Sep 07 2000 - 02:42:31 EDT


David Starner <dvdeug@x8b4e53cd.dhcp.okstate.edu> wrote:

> On Tue, Jul 18, 2000 at 08:47:41PM -0800, Doug Ewell wrote:
>> Not even CLOSE to a complete list. From the forthcoming(1) bestseller
>> "The Quadrature of Unicode":

<snip>

> Do any of these actually use an initial BOM in practice? I'm about
> to write a Unicode signature detector for Ngeadal, and I may as well
> detect anything I can. (And since Ngeadal may end up supporting any
> of the above I can get specs on . . .)

My posting from July 18 was semi-serious.

UTF-1 has been removed from the Unicode Standard. Its advantages of C1
transparency and near-Latin-1 transparency were offset by its use of
7-bit ASCII characters in multibyte sequences and its computational
inefficiency. It has been superseded by UTF-8. In any case, any UTF-1
data that may exist in the real world probably would not have a BOM,
since widespread recommendation of the BOM-as-signature came after the
replacement of UTF-1 by UTF-8.

Most UTF-7 data probably does not have a BOM either, but if it did, the
exact bytes would not necessarily be 2B 2F 76 38 2D, but would depend
on the character immediately following the BOM.

UTR #16, which specifies UTF-EBCDIC (it may be a UTS by now; I haven't
checked), does specify the use of a BOM-as-signature. So if there is
any UTF-EBCDIC data in the real world, you would probably want to check
for that signature.

UCN data probably will not have a BOM, but the sequence "\ufeff" (and
case-shifted equivalents) certainly seems as though it could only be
intended to have that meaning.

All the others are private or semi-private experiments, and regardless
of their merits or faults, you will almost certainly never encounter
any real-world data encoded in them.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT