Markus Scherer <markus.scherer@jtcsv.com> wrote:
> notepad always saves unicode-encoded files with the appropriate
> signature byte sequence, like most other microsoft-apps and many
> other well-behaved applications.
>
> they are the first 2 to 4 bytes in the text file, encode U+feff
> in the particular encoding scheme, and are as follows:
>
> utf-8: ef bb bf
> utf-16be: fe ff
> utf-16le: ff fe
> utf-32be: 00 00 fe ff
> utf-32le: ff fe 00 00 (check before utf-16le!)
> scsu: 0e fe ff (unfortunately rather rarely used)
Not even CLOSE to a complete list. From the forthcoming(1) bestseller
"The Quadrature of Unicode":
UTF-1: F7 64 4C
UTF-7: 2B 2F 76 38 2D "+/v8-"
UTF-7d5: BF FB FF
UTF-8C1: BB ED DF
UTF-9: 93 FD FF
UTF-EBCDIC: DD 73 66 73
UTF-mu(2): 9F 9B FF
UCN(3): 5C 75 66 65 66 66 "\ufeff"
DUCK(4): 81 FE FF
Needless to say, most of these additional encoding forms/schemes range
from the sublime to the ridiculous. Don't use any of them in the real
world except UTF-7, UTF-EBCDIC, and UCN, and those only when you must.
(Although I'm considering recommending UTF-1 to people who insist on
C1 transparency and Latin-1 legibility in a UTF.)
Notes:
(1) Don't look for it in your local bookstore any time soon.
(2) Submitted by a fellow list member (along with the book title).
(3) Universal Character Name convention, also known as Java escape
sequences.
(4) Doug's Unicode Compression Kludge, invented in 1996 before I knew
about any of the real UTF's. Nicknamed "UTF-Doug" by Peter
Constable in a 1998 discussion.
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT