Re: UTF-24

From: Doug Ewell (dewell@adelphia.net)
Date: Fri Apr 04 2003 - 01:36:54 EST

  • Next message: John Cowan: "Re: Exciting new software release!"

    Pim Blokland <pblokland at planet dot nl> wrote:

    > Why is there no UTF-24?
    >
    > See, these MathText characters take up a lot of space. No matter how
    > you encode them; UTF-8, UTF-16 or UTF-32; they always are 4 bytes
    > long. Now if we had UTF-24, they would only take up 3 bytes.

    Yes, but supplementary characters will normally appear in one of two
    circumstances:

    (1) as part of a small alphabet (e.g. Deseret, Shavian, Osmanya),
    interspersed with spaces and punctuation in the U+00xx range, in which
    case an existing storage format (SCSU) can encode them in only 1 byte
    each plus an initial 3-byte overhead.

    (2) as part of a larger set (e.g. math symbols, CJK Extension B),
    interspersed with even more BMP characters, in which case the bytes
    saved on each supplementary character are overwhelmed by the bytes
    squandered on each BMP character.

    > And since the Unicode character range is formally defined to run no
    > higher than U+10FFFD, which fits in 3 bytes, I see no reason why
    > no-one has ever gone to the trouble of defining a 3-byte storage
    > method.

    Most likely because no modern computer uses a 3-byte (24-bit) internal
    processing unit, and because it would be false economy for real-world
    Unicode text (see (1) and (2) above).

    > Implementation would be easy; there would be only two variants,
    > UTF-24LE and UTF-24BE, and that's it.

    I agree, it certainly was easy to implement. (Oops, did I say that out
    loud?)

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Fri Apr 04 2003 - 02:15:41 EST