Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 06 2004 - 15:21:41 CST

  • Next message: E. Keown: "proposals I wrote (and also, didn't write)"

    ----- Original Message -----
    From: "Arcane Jill" <arcanejill@ramonsky.com>
    > Probably a dumb question, but how come nobody's invented "UTF-24" yet? I
    > just made that up, it's not an official standard, but one could easily
    > define UTF-24 as UTF-32 with the most-significant byte (which is always
    > zero) removed, hence all characters are stored in exactly three bytes and
    > all are treated equally. You could have UTF-24LE and UTF-24BE variants,
    > and even UTF-24 BOMs. Of course, I'm not suggesting this is a particularly
    > brilliant idea, but I just wonder why no-one's suggested it before.

    UTF-24 already exists as an encoding form (it is identical to UTF-32), if
    you just consider that encoding forms just need to be able to represent a
    valid code range within a single code unit.
    UTF-32 is not meant to be restricted on 32-bit representations.

    However it's true that UTF-24BE and UTF-24LE could be useful as a encoding
    schemes for serializations to byte-oriented streams, suppressing one
    unnecessary byte per code point.

    > (And then of course, there's UTF-21, in which blocks of 21 bits are
    > concatenated, so that eight Unicode characters will be stored in every 21
    > bytes - and not to mention UTF-20.087462841250343, in which a plain text
    > document is simply regarded as one very large integer expressed in radix
    > 1114112, and whose UTF-20.087462841250343 representation is simply that
    > number expressed in binary. But now I'm getting /very/ silly - please
    > don't take any of this seriously.) :-)

    I don't think that UTF-21 would be useful as an encoding form, but possibly
    as a encoding scheme where 3 always-zero bits would be stripped, providing a
    tiny compression level, which would only be justified for transmission over
    serial or network links.

    However I do think that such "optimization" would have the effect of
    removing byte alignments, on which more powerful compressors are working. If
    you really need a more effective compression use SCSU or apply some deflate
    or bzip2 compression to UTF-8, UTF-16, or UTF-24/32... (there's not much
    difference between compressing UTF-24 or UTF-32 with generic compression
    algorithms like deflate or bzip2).

    > The "UTF-24" thing seems a reasonably sensible question though. Is it just
    > that we don't like it because some processors have alignment restrictions
    > or something?

    There does exists, even still today, 4-bit processors, and 1-bit processors,
    where the smallest addressable memory unit is smaller than 8-bit. They are
    used for lowcost micro-devices, notably to build automated robots for the
    industry, or even for many home/kitchen devices. I don't know whever they do
    need Unicode to represent international text, given that they often have a
    very limited user interface, incapable of inputing or output text, but who
    knows? May be they are used in some mobile phones, or within "smart"
    keyboards or tablets or other input devices connected to PCs...

    There also exists systems where the smallest addressable memory cell is a
    9-bit byte. This is more an issue here, because the Unicode standard does
    not specify whever encoding schemes (that serialize code points to bytes)
    should set the 9th bit of each byte to 0, or should fill every 8 bit of
    memory, even if this means that 8-bit bytes of UTF-8 will not be
    synchronized with memory 9-bit bytes.

    Somebody already introduced UTF-9 in the past for 9-bit systems.

    A 36-bit processor could as well address the memory by cells of 36 bits,
    where the 4 highest bits would be either used for CRC control bits
    (generated and checked automatically by the processor or a memory bus
    interface within memory regions where this behavior would be allowed), or
    either used to store supplementary bits of actual data (in unchecked regions
    that fit in reliable and fast memory, such as the internal memory cache of
    the CPU, or static CPU registers).

    For such things, the impact of the transformation of addressable memory
    widths through interfaces is for now not discussed in Unicode, which
    supposes that internal memory is necessarily addressed in a power of 2 and a
    multiple of 8 bits, and then interchanged or stored using this byte unit.

    Today, we assist to the constant expansion of bus widths to allow parallel
    processing instead of multiplying the working frequency (and the energy
    spent and temperature, which generates other environmental problems), so why
    the 8-bit byte unit would remain the most efficient universal unit? If you
    look at IEEE floatting point formats, they are often implemented in FPU
    working on 80-bit units, and a 80-bit memory cell could as well become
    tomorrow a standard (compatible with the increasingly used 64-bit
    architectures of today) which would no longer be a power of 2 (even if this
    stays a multiple of 8 bits).

    On a 80-bit system, the easiest solution for handling UTF-32 without using
    too much space would be a unit of 40-bits (i.e. two code points per 80-bit
    memory cell). But if you consider that 21 bits only are used in Unicode,
    than each 80-bit memory cell could store three code points, leaving 17 bits
    unused in each addressable memory cell.

    Note that 64-bit systems could do the same: 3 code points per 64-bit unit,
    requires only 63 bits, that are stored in a single positive 64-bit integer
    (the remaining bit would be the sign bit, always set to 0, avoiding problems
    related to sign extensions). And even today's system could use such
    representation as well, given that most 32-bit processors of today also have
    the internal capabilities to manage 64-bit integers natively.

    Strings could be encoded as well using only 64-bit code units that would
    each store 1 to 3 code points, the unused positions being filled with
    invalid codepoints out the Unicode space (for example by setting all 21 bits
    to 1, producing the out-of-range code point 0x1FFFFF, used as a filler for
    missing code points, notably when the string to encode is not an exact
    multiple of 3 code points). Then, these 64-bit code units could be
    serialized on byte streams as well, multiplying the number of possibilities:
    UTF-64BE and UTF-64LE? One interest of such scheme is that it would be more
    compact than UTF-32, because this UTF-64 encoding scheme would waste only 1
    bit for 3 codepoints, instead 1 byte and 3 bits for each codepoint with
    UTF-32!

    You can imagine many other encoding schemes, depending on your architecture
    choices and constraints...



    This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 15:22:57 CST