Re: Proposing UTF-21/24

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Jan 21 2007 - 21:59:30 CST

  • Next message: vunzndi@vfemail.net: "Re: Regulating PUA."

    Ruszlan Gaszanov <ruszlan at ather dot net> wrote:

    >> Some of your arguments like "won't need a BOM anymore" don't make
    >> sense for me...
    >
    > Well, since conversions between UTF-21/24 and UTF-32 (and UTF-16 for
    > BMP characters) is very trivial - much more so then with UTF-8, some
    > applications designers might prefer to use the same byte order for
    > UTF-21/24 as they are using for UTF-16/32 in order to make processing
    > faster. Hence we might get BE/LE varieties of UTF-21/24 and have to
    > deal with BOM issue. Therefore, the error dedection mechanisms I
    > proposed for UTF-24 varieties also allow automatic byte order
    > detection.

    Conversion to and from UTF-8 is really quite simple. It may look like a
    lot of lines of code, but most of it is conditional -- only one of the
    branches runs for each lead byte.

    Ruszlán, take it from me: I was a well-known inventor of alternative
    UTFs several years ago, and as far back as 1998 I came up with a
    compression scheme that vaguely resembled SCSU window switching
    (simpler, but less efficient). Gradually and patiently, I was persuaded
    (and saw for myself) that these alternative schemes had no chance of
    widespread adoption. Even if they were better, they were not "better
    enough." Eventually, after learning quite a bit about encoding
    strategies and Unicode policy, I stopped invented and learned to embrace
    the existing encoding schemes.

    Your ideas reminded me of the variable-length scheme Frank mentioned.
    (I thought I had invented that one too, based on Mark Crispin's
    mostly-whimsical UTF-9 RFC.) I actually do use that scheme for some
    internal purposes, not all of which have to do with Unicode code points.
    For example, I'm working on a Unicode-enabled Huffman encoder that uses
    the variable-length scheme to store charatcer frequencies. It has the
    advantage of not being limited to 0x10FFFF or any other number, and for
    this purpose the "false ASCII find" problem is not a problem at all.

    But for storage purposes, you don't want to use 3 bytes for each
    character -- not with the overwhelming prevalence of BMP characters in
    almost all text. There's a reason why almost nobody uses UTF-32;
    cutting the storage from four bytes to three won't change that. And for
    interchange, you don't want the overhead of calculating or checking
    parity for each 3-byte series. It's not as computationally cheap as it
    seems, compared to decoding UTF-8 or even SCSU. (The complexity of
    decoding SCSU is vastly overstated, as I wrote in Unicode Technical Note
    #14.)

    It's true that you don't need a Byte Order Mark per se with a byte-based
    encoding such as this, but you might still want to be able to use U+FEFF
    as an encoding signature. All Unicode encodings have this defined. The
    problem with U+FEFF is not so much its use as a byte order mark or
    signature, but rather its parallel and conflicting use as a zero-width
    no-break space (which was never widely used and which is now
    deprecated).

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 22:01:24 CST