Re: length of text by different languages

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Mar 06 2003 - 02:59:20 EST

  • Next message: Chris Jacobs: "Re: The display of *kholam* on PCs"

    Yung-Fong Tang <ftang at netscape dot com> wrote:

    > I remember there were some study to show although UTF-8 encode each
    > Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
    > LESS characters in writting to communicate information than alphabetic
    > base langauges.
    >
    > Any one can point to me such research? Martin, do you have some paper
    > about that ?

    You are possibly thinking of a paper called "re-ordering.txt" by Bruce
    Thomson.

    In the IDN (internationalized domain name) working group, in late 2001,
    there was a proposal by Soobok Lee to improved the compression of domain
    names containing Hangul characters by reordering them so that the most
    common characters would be closer together. This was considered
    significant because of the 63-byte limit imposed on DNS labels. All IDN
    applications would have required huge mapping tables in order to
    implement this. Lee's proposal included reordering tables for other
    scripts, but it was obvious that his primary goal was to optimize
    compression for Hangul.

    Thomson's paper was basically a distillation of the working group's
    arguments for and against Lee's reordering proposal. It was intended to
    be neutral, but ended up refuting many of the pro-reordering arguments.

    One of Lee's claims was that Hangul was represented in Unicode in an
    unfairly inefficient way, because each Hangul syllable consumes 2 bytes
    in UTF-16 and 3 bytes in UTF-8, while direct encoding of jamos instead
    of syllables is even more inefficient. In response, Thomson wrote that
    the Book of Genesis in various languages requires:

    3088 characters in English using ASCII
    778 characters in Chinese using Han characters
    1201 characters in Korean using Hangul syllables

    and combined this data with the average compression achieved by
    AMC-ACE-Z (now called "Punycode") to derive meaningful comparisons.

    It stands to reason that a logographic or syllable-based encoding will
    pack more information into each code unit than an alphabetic encoding.

    I can provide a copy of Thomson's paper if Tang or anyone else is
    interested.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Thu Mar 06 2003 - 03:55:30 EST