More ways to encode U+FEFF (was: Re: Designing a multilingual web site)

From: Doug Ewell (dewell@compuserve.com)
Date: Wed Jul 19 2000 - 00:52:43 EDT


Markus Scherer <markus.scherer@jtcsv.com> wrote:

> notepad always saves unicode-encoded files with the appropriate
> signature byte sequence, like most other microsoft-apps and many
> other well-behaved applications.
>
> they are the first 2 to 4 bytes in the text file, encode U+feff
> in the particular encoding scheme, and are as follows:
>
> utf-8: ef bb bf
> utf-16be: fe ff
> utf-16le: ff fe
> utf-32be: 00 00 fe ff
> utf-32le: ff fe 00 00 (check before utf-16le!)
> scsu: 0e fe ff (unfortunately rather rarely used)

Not even CLOSE to a complete list. From the forthcoming(1) bestseller
"The Quadrature of Unicode":

UTF-1: F7 64 4C
UTF-7: 2B 2F 76 38 2D "+/v8-"
UTF-7d5: BF FB FF
UTF-8C1: BB ED DF
UTF-9: 93 FD FF
UTF-EBCDIC: DD 73 66 73
UTF-mu(2): 9F 9B FF
UCN(3): 5C 75 66 65 66 66 "\ufeff"
DUCK(4): 81 FE FF

Needless to say, most of these additional encoding forms/schemes range
from the sublime to the ridiculous. Don't use any of them in the real
world except UTF-7, UTF-EBCDIC, and UCN, and those only when you must.
(Although I'm considering recommending UTF-1 to people who insist on
C1 transparency and Latin-1 legibility in a UTF.)

Notes:
(1) Don't look for it in your local bookstore any time soon.
(2) Submitted by a fellow list member (along with the book title).
(3) Universal Character Name convention, also known as Java escape
     sequences.
(4) Doug's Unicode Compression Kludge, invented in 1996 before I knew
     about any of the real UTF's. Nicknamed "UTF-Doug" by Peter
     Constable in a 1998 discussion.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT