David Starner <dvdeug@x8b4e53cd.dhcp.okstate.edu> wrote:
> On Tue, Jul 18, 2000 at 08:47:41PM -0800, Doug Ewell wrote:
>> Not even CLOSE to a complete list. From the forthcoming(1) bestseller
>> "The Quadrature of Unicode":
<snip>
> Do any of these actually use an initial BOM in practice? I'm about
> to write a Unicode signature detector for Ngeadal, and I may as well
> detect anything I can. (And since Ngeadal may end up supporting any
> of the above I can get specs on . . .)
My posting from July 18 was semi-serious.
UTF-1 has been removed from the Unicode Standard. Its advantages of C1
transparency and near-Latin-1 transparency were offset by its use of
7-bit ASCII characters in multibyte sequences and its computational
inefficiency. It has been superseded by UTF-8. In any case, any UTF-1
data that may exist in the real world probably would not have a BOM,
since widespread recommendation of the BOM-as-signature came after the
replacement of UTF-1 by UTF-8.
Most UTF-7 data probably does not have a BOM either, but if it did, the
exact bytes would not necessarily be 2B 2F 76 38 2D, but would depend
on the character immediately following the BOM.
UTR #16, which specifies UTF-EBCDIC (it may be a UTS by now; I haven't
checked), does specify the use of a BOM-as-signature. So if there is
any UTF-EBCDIC data in the real world, you would probably want to check
for that signature.
UCN data probably will not have a BOM, but the sequence "\ufeff" (and
case-shifted equivalents) certainly seems as though it could only be
intended to have that meaning.
All the others are private or semi-private experiments, and regardless
of their merits or faults, you will almost certainly never encounter
any real-world data encoded in them.
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT