Re: [unicode] More ways to encode U+FEFF (was: Re: Designing a multilingual

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Sep 06 2000 - 14:52:57 EDT


David Starner asked:

> On Tue, Jul 18, 2000 at 08:47:41PM -0800, Doug Ewell wrote:
> > Not even CLOSE to a complete list. From the forthcoming(1) bestseller
> > "The Quadrature of Unicode":
> >
> > UTF-1: F7 64 4C
> > UTF-7: 2B 2F 76 38 2D "+/v8-"
> > UTF-7d5: BF FB FF
> > UTF-8C1: BB ED DF
> > UTF-9: 93 FD FF
> > UTF-EBCDIC: DD 73 66 73
> > UTF-mu(2): 9F 9B FF
> > UCN(3): 5C 75 66 65 66 66 "\ufeff"
> > DUCK(4): 81 FE FF
>
> Do any of these actually use an initial BOM in practice?

None of these except UTF-EBCDIC, UTF-7 (which is deprecated),
and the UCN's are in practice. And the use of UCN's is not really
an encoding scheme, but an escaping mechanism, comparable to the
use of numerical entity references. I.e. "5C 75 66 65 66 66" is
actually in UTF-8, but given the correct parser, a preprocessing
step on the text can pick this sequence out as an escaped reference
to another character, and replace it with the actual character,
namely U+FEFF (or 0xEF 0xBB 0xBF in this UTF-8 character stream).

> I'm about
> to write a Unicode signature detector for Ngeadal, and I may as well
> detect anything I can. (And since Ngeadal may end up supporting any
> of the above I can get specs on . . .)

Not a good idea. Stick to support of the standard encoding schemes.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT