Re: "Phonetic grouping" in UniHan

From: Thomas Chan (thomas@datexx.com)
Date: Mon Feb 04 2002 - 12:31:59 EST


On Mon, 4 Feb 2002, Marco Cimarosti wrote:

> I also take the occasion to suggest a new field that could be very useful:
> the frequency of usage of each character. This information may be derived
> from good on-line sources. E.g., for Chinese, from Chi-Ho Tsai's research
> (http://www.geocities.com/hao510/charfreq/) and, for Japanese, from the
> KanjiDic database, (http://www.csse.monash.edu.au/~jwb/kanjidic_doc.html).
> (I don't know the licensing terms for using these data.)

I think whatever frequency data is included, the particulars of how they
were arrived at (or where to find such information) should be included,
e.g., Tsai's findings were based on 1993-1994 Big5 Usenet postings.

There's also frequency data buried under the kFenn field (as yet
unpopulated), where A, B, C, D, E, F, G, H, I, K ("J" is omitted)
indicates if it falls in the first, second, third, etc group of five
hundred characters, based on "earliness of occurrence in the textbooks of
1926". (The P code is also used for something that is not quite clear to
me from the explanation in the dictionary alone--I presume it might refer
to characters in the dictionary that were not in the 1926 study.)

P.S. Recently you asked about estimates of usage of Plane 2
characters--since a large percentage are CNS 11643-1992 characters (and
perhaps the oldest IT source), that may provide a clue. In the
"Concluding Remarks" section of Christian Wittern's "Taming the
Masses"[1], the higher CNS planes (ignore 1 and 2, which are in the
BMP, and perhaps some parts of 3) are rarely used in historic texts, and
he expects even lower usage in modern texts.

[1] http://www.gwdg.de/~cwitter/cw/taming.html

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Mon Feb 04 2002 - 11:39:22 EST