From: Yung-Fong Tang (ftang@netscape.com)
Date: Wed Mar 05 2003 - 18:55:18 EST
I remember there were some study to show although UTF-8 encode each
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
LESS characters in writting to communicate information than alphabetic
base langauges.
Any one can point to me such research? Martin, do you have some paper
about that ?
I would like to find out the average ration between
English,
Geram,
French,
Japanese,
Chinese,
Korean
in term of the number of characters, and in term of the bytes needed to
encode in UTF-8
If such research information have not been done, maybe one way to figure
the result is to take tranlated Bible fo these language from swords
project, strip out those xml tag and leave the pure text, and measure
the size. Since all the Bible translation communicate the same
information and the volumn is huge enough, that could be a good way to
find out the result. Of course, those mark up need to be taken out to
reduce the noise.
This archive was generated by hypermail 2.1.5 : Wed Mar 05 2003 - 19:31:00 EST