Re: length of text by different languages

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Mar 06 2003 - 02:59:20 EST

Next message: Chris Jacobs: "Re: The display of *kholam* on PCs"

Previous message: Francois Yergeau: "RE: length of text by different languages"
In reply to: Yung-Fong Tang: "length of text by different languages"
Next in thread: Jon Babcock: "Re: length of text by different languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Yung-Fong Tang <ftang at netscape dot com> wrote:

> I remember there were some study to show although UTF-8 encode each
> Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
> LESS characters in writting to communicate information than alphabetic
> base langauges.
>
> Any one can point to me such research? Martin, do you have some paper
> about that ?

You are possibly thinking of a paper called "re-ordering.txt" by Bruce
Thomson.

In the IDN (internationalized domain name) working group, in late 2001,
there was a proposal by Soobok Lee to improved the compression of domain
names containing Hangul characters by reordering them so that the most
common characters would be closer together. This was considered
significant because of the 63-byte limit imposed on DNS labels. All IDN
applications would have required huge mapping tables in order to
implement this. Lee's proposal included reordering tables for other
scripts, but it was obvious that his primary goal was to optimize
compression for Hangul.

Thomson's paper was basically a distillation of the working group's
arguments for and against Lee's reordering proposal. It was intended to
be neutral, but ended up refuting many of the pro-reordering arguments.

One of Lee's claims was that Hangul was represented in Unicode in an
unfairly inefficient way, because each Hangul syllable consumes 2 bytes
in UTF-16 and 3 bytes in UTF-8, while direct encoding of jamos instead
of syllables is even more inefficient. In response, Thomson wrote that
the Book of Genesis in various languages requires:

3088 characters in English using ASCII
778 characters in Chinese using Han characters
1201 characters in Korean using Hangul syllables

and combined this data with the average compression achieved by
AMC-ACE-Z (now called "Punycode") to derive meaningful comparisons.

It stands to reason that a logographic or syllable-based encoding will
pack more information into each code unit than an alphabetic encoding.

I can provide a copy of Thomson's paper if Tang or anyone else is
interested.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Chris Jacobs: "Re: The display of *kholam* on PCs"
Previous message: Francois Yergeau: "RE: length of text by different languages"
In reply to: Yung-Fong Tang: "length of text by different languages"
Next in thread: Jon Babcock: "Re: length of text by different languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 06 2003 - 03:55:30 EST