Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 07 2004 - 14:37:23 CST

  • Next message: Rick McGowan: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."

    Philippe stated, and I need to correct:

    > UTF-24 already exists as an encoding form (it is identical to UTF-32), if
    > you just consider that encoding forms just need to be able to represent a
    > valid code range within a single code unit.

    This is false.

    Unicode encoding forms exist by virtue of the establishment of
    them as standard, by actions of the standardizing organization,
    the Unicode Consortium.

    > UTF-32 is not meant to be restricted on 32-bit representations.

    This is false. The definition of UTF-32 is:

      "The Unicode encoding form which assigns each Unicode scalar
       value to a single unsigned 32-bit code unit with the same
       numeric value as the Unicode scalar value."
       
    It is true that UTF-32 could be (and is) implemented on computers
    which hold 32-bit numeric types transiently in 64-bit registers
    (or even other size registers), but if an array of 64-bit integers
    (or 24-bit integers) were handed to some API claiming to be UTF-32,
    it would simply be nonconformant to the standard.

    UTF-24 does not "already exist as an encoding form" -- it already
    exists as one of a large number of more or less idle speculations
    by character numerologists regarding other cutesy ways to handle
    Unicode characters on computers. Many of those cutesy ways are
    mere thought experiments or even simply jokes.

    > However it's true that UTF-24BE and UTF-24LE could be useful as a encoding
    > schemes for serializations to byte-oriented streams, suppressing one
    > unnecessary byte per code point.

    "Could be", perhaps, but is not.

    Implementers using UTF-32 for processing efficiency, but who have
    bandwidth constraints in some streaming context should simply
    use one of the CES's with better size characteristics or use
    a compression on their data.

    > Note that 64-bit systems could do the same: 3 code points per 64-bit unit,
    > requires only 63 bits, that are stored in a single positive 64-bit integer
    > (the remaining bit would be the sign bit, always set to 0, avoiding problems
    > related to sign extensions). And even today's system could use such
    > representation as well, given that most 32-bit processors of today also have
    > the internal capabilities to manage 64-bit integers natively.

    This is just an incredibly bad idea.

    Packing instructions in large-word microprocessors is one thing. You
    have built-in microcode which handles that, hidden away from
    application-level programming, and carefully architected for
    maximal processor efficiency.

    But attempting to pack character data into microprocessor words, just
    because you have bits available, would just detract from the efficiency
    of handling that data. Storage is not the issue -- you want to
    get the characters in and out of the registers as efficiently as
    possible. UTF-32 works fine for that. UTF-16 works almost as well,
    in aggregate, for that. And I could care less that when U+0061
    goes in a 64-bit register for manipulation, the high 57 bits are
    all set to zero.

    > Strings could be encoded as well using only 64-bit code units that would
    > each store 1 to 3 code points,

    Yes, and pigs could fly, if they had big enough wings.

    > the unused positions being filled with
    > invalid codepoints out the Unicode space (for example by setting all 21 bits
    > to 1, producing the out-of-range code point 0x1FFFFF, used as a filler for
    > missing code points, notably when the string to encode is not an exact
    > multiple of 3 code points). Then, these 64-bit code units could be
    > serialized on byte streams as well, multiplying the number of possibilities:
    > UTF-64BE and UTF-64LE? One interest of such scheme is that it would be more
    > compact than UTF-32, because this UTF-64 encoding scheme would waste only 1
    > bit for 3 codepoints, instead 1 byte and 3 bits for each codepoint with
    > UTF-32!

    Wow!

    > You can imagine many other encoding schemes, depending on your architecture
    > choices and constraints...

    Yes, one can imagine all sorts of strange things. I myself
    imagined UTF-17 once. But there is a difference between having
    fun imagining strange things and filling the list with
    confusing misinterpretations of the status and use of
    UTF-8, UTF-16, and UTF-32.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 14:40:35 CST