Re: UTF-12!

From: Doug Ewell (doug@ewellic.org)
Date: Mon Feb 28 2011 - 11:12:55 CST

  • Next message: Doug Ewell: "Re: UTF-12!"

    Petr Tomasek <tomasek at etf dot cuni dot cz> wrote:

    > Hm, what about UTF-64? Allmost everyone has 64bit machines today...

    Marco Cimarosti, a former co-offender in creating experimental
    encodings, described UTF-64 in May 2001. It used 63 bits to encode a
    block of either (a) nine 7-bit Basic Latin characters or (b) three
    21-bit characters, one of which was presumably not Basic Latin, together
    with a 64th bit to indicate the type of block.

    Van's sarcastic algorithm brings up a few additional goals to add to my
    list:

    • code units align with machine boundaries (8, 16, 32 bits)
    • unique encoded form for each character
    • unique encoded form for each character in context, or for each text
    • minimize or avoid state

    Remember that one point of this list is to demonstrate that not all
    goals can be met by a single encoding.

    Speaking of goals, Thomas' claim that UTF-c "avoids over-long forms of
    characters" turns out not to be true, since characters belonging to the
    selected 64-block can still be encoded using the long form. Encouraging
    users to use the shortest form (like UTF-8) is not the same as
    syntactically not providing a non-shortest form (like UTF-16 and -32).

    --
    Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
    RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Mon Feb 28 2011 - 11:17:39 CST