Re: Unicode forms for internal storage - BOCU-1 speed

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jan 22 2004 - 16:11:44 EST

  • Next message: jcowan@reutershealth.com: "Re: Unicode forms for internal storage - BOCU-1 speed"

    From: <jcowan@reutershealth.com>
    > Mark Crispin's UTF-9 (not to be confused with Jerome Abela's) is also
    > excellent, although most of us don't have 36-bit systems, for which it
    > makes sense. A precis:
    >
    > Code points (base 2) UTF-9 code units (base 2)
    > 0000000000000abcdefgh 0abcdefgh
    > 00000abcdefghijklmnop 1abcdefgh 0ijklmnop
    > abcdefghijklmnopqrstu 1000abcde 1fghijklm 0nopqrstu
    >
    > This is almost as good as Latin-1 for its repertoire, only minutely worse
    > than UTF-16 for the rest of the BMP, and beats all other encodings for the
    > other planes.

    Is the other competing UTF-9 from Jerome Abela this one:

    21-bit code points (base 2) -> 9-bit UTF-9 code units (base 2)
    0000000000000hgfedcba -> 0hgfedcba (Latin1: 8bits)
    000000onmlkjihgfedcba -> 10onmlkji 0hgfedcba (low half-BMP: 15bits)
    utsrqponmlkjihgfedcba -> 110utsrqp 10onmlkji 0hgfedcba (rest: 21 bits)

    ???

    The "excellent" UTF-9 encoding from Mark Crispin has the problem that it
    requires looking up at the second character to know if the sequence starting
    by base-2 '1000abcde' is encoded with 2 or 3 UTF-9 code units; but the high
    bit of the first code unit indicates that it is followed at least one other
    code unit, so it effectively allows looking up at the second character to
    see if its highest bit is set or not.

    The second encoding has the problem that it splits the basic Han ideograph
    blocks in two parts encoded in two parts, requiring 3 code units instead of
    just 2 for the last part of the CJK Ideograph block, all Hangul syllables,
    and compatibility characters narrow/fullwidth forms, presentation forms,
    Arabic contextual forms and ligatures. As there's no way to allow the basic
    CJK block to fit in the second encoding form, the 15 bits will be better
    used if it excludes the Latin1 block, the CJK block, but includes the Hangul
    syllables and compatibility characters. Another way is to exclude the CJK
    Ideograph Extension A block, to make the basic CJK Ideograph block fit as
    Hangul can be represented also in NFD form without any syllable in the upper
    half of the BMP.

    36-bit systems are not completely uncommon: there are some processors that
    allow working in 32-bit mode with error correction code for external memory,
    or in 36-bit mode for internal high-speed memory, where the extra bits are
    usable to facilitate the arithmetic computing of large numbers with extra
    carry/borrow bits, or in internal computing of floating point numbers
    expressions with higher intermediate precision. These processors are not the
    most common ones, but let's not exclude them from reappearing later with a
    72-bit processing model working in a compatibility mode with 64-bit code.
    After all 72 bit is also exactly 9 bytes and will work very well with
    storage devices and many byte-oriented serialization protocols...



    This archive was generated by hypermail 2.1.5 : Thu Jan 22 2004 - 16:51:49 EST