From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jan 22 2004 - 16:11:44 EST
From: <jcowan@reutershealth.com>
> Mark Crispin's UTF-9 (not to be confused with Jerome Abela's) is also
> excellent, although most of us don't have 36-bit systems, for which it
> makes sense. A precis:
>
> Code points (base 2) UTF-9 code units (base 2)
> 0000000000000abcdefgh 0abcdefgh
> 00000abcdefghijklmnop 1abcdefgh 0ijklmnop
> abcdefghijklmnopqrstu 1000abcde 1fghijklm 0nopqrstu
>
> This is almost as good as Latin-1 for its repertoire, only minutely worse
> than UTF-16 for the rest of the BMP, and beats all other encodings for the
> other planes.
Is the other competing UTF-9 from Jerome Abela this one:
21-bit code points (base 2) -> 9-bit UTF-9 code units (base 2)
0000000000000hgfedcba -> 0hgfedcba (Latin1: 8bits)
000000onmlkjihgfedcba -> 10onmlkji 0hgfedcba (low half-BMP: 15bits)
utsrqponmlkjihgfedcba -> 110utsrqp 10onmlkji 0hgfedcba (rest: 21 bits)
???
The "excellent" UTF-9 encoding from Mark Crispin has the problem that it
requires looking up at the second character to know if the sequence starting
by base-2 '1000abcde' is encoded with 2 or 3 UTF-9 code units; but the high
bit of the first code unit indicates that it is followed at least one other
code unit, so it effectively allows looking up at the second character to
see if its highest bit is set or not.
The second encoding has the problem that it splits the basic Han ideograph
blocks in two parts encoded in two parts, requiring 3 code units instead of
just 2 for the last part of the CJK Ideograph block, all Hangul syllables,
and compatibility characters narrow/fullwidth forms, presentation forms,
Arabic contextual forms and ligatures. As there's no way to allow the basic
CJK block to fit in the second encoding form, the 15 bits will be better
used if it excludes the Latin1 block, the CJK block, but includes the Hangul
syllables and compatibility characters. Another way is to exclude the CJK
Ideograph Extension A block, to make the basic CJK Ideograph block fit as
Hangul can be represented also in NFD form without any syllable in the upper
half of the BMP.
36-bit systems are not completely uncommon: there are some processors that
allow working in 32-bit mode with error correction code for external memory,
or in 36-bit mode for internal high-speed memory, where the extra bits are
usable to facilitate the arithmetic computing of large numbers with extra
carry/borrow bits, or in internal computing of floating point numbers
expressions with higher intermediate precision. These processors are not the
most common ones, but let's not exclude them from reappearing later with a
72-bit processing model working in a compatibility mode with 64-bit code.
After all 72 bit is also exactly 9 bytes and will work very well with
storage devices and many byte-oriented serialization protocols...
This archive was generated by hypermail 2.1.5 : Thu Jan 22 2004 - 16:51:49 EST