Re: Proposing UTF-21/24

From: Frank Ellermann (nobody@xyzzy.claranet.de)
Date: Mon Jan 22 2007 - 12:49:30 CST

  • Next message: Asmus Freytag: "Re: Proposing a DOUBLE HYPHEN punctuation mark"

    Mark Davis wrote:

    > This has the very significant problem of ASCII incompatibility: the
    > key advantage of UTF-8 is that values of 0..127 are never part of a
    > multibyte character. That is one of the reasons why the simple
    > approach of just using 7 bits of content with a bit to say "has
    > continuation", while considered, never got any traction.

    Yes, "get a 1:1 correspondence for the 128 ASCII octets" was another
    goal, in addition to "find something working for 31 bits". And let
    a single error destroy only one code point.

    For UTF-1 a goal was to protect the 64 control characters, also fine,
    but unfortunately not what actually counts for some legacy protocols.
    And the modulo 192 in UTF-1 is stranger than the modulo 64 in UTF-8.
    Modulo 243 in BOCU-1 is the oddest, protecting 256-243=13 important
    ASCII characters.

    > IMO, the whole discussion of "UTF-24" is of only academic interest

    ACK, the field of compression is explored in almost all directions.
    My own experiments go in the opposite direction, expansion: protect
    224 Latin-1 characters (C0, G0, G1) instead of only ASCII (C0 + G0),
    and use the remaining 32 octets to encode any code point outside of
    the "visible Latin-1 or C0" set. Only legacy text applications can
    really use this for documents mostly in Latin-1. With that I got a
    modulo 16 (hex.) "UTF-4" scheme, otherwise the same design as UTF-8.

    But a decent escape mechanism with hex. XML NCRs is good enough, and
    so "UTF-4" is also only academic. At least it convinced that it's
    impossible to "improve" UTF-8 without giving up one or more of its
    design goals.

    Frank



    This archive was generated by hypermail 2.1.5 : Mon Jan 22 2007 - 12:55:46 CST