Re: "Phonetic grouping" in UniHan

From: Thomas Chan (
Date: Mon Feb 04 2002 - 12:31:59 EST

On Mon, 4 Feb 2002, Marco Cimarosti wrote:

> I also take the occasion to suggest a new field that could be very useful:
> the frequency of usage of each character. This information may be derived
> from good on-line sources. E.g., for Chinese, from Chi-Ho Tsai's research
> ( and, for Japanese, from the
> KanjiDic database, (
> (I don't know the licensing terms for using these data.)

I think whatever frequency data is included, the particulars of how they
were arrived at (or where to find such information) should be included,
e.g., Tsai's findings were based on 1993-1994 Big5 Usenet postings.

There's also frequency data buried under the kFenn field (as yet
unpopulated), where A, B, C, D, E, F, G, H, I, K ("J" is omitted)
indicates if it falls in the first, second, third, etc group of five
hundred characters, based on "earliness of occurrence in the textbooks of
1926". (The P code is also used for something that is not quite clear to
me from the explanation in the dictionary alone--I presume it might refer
to characters in the dictionary that were not in the 1926 study.)

P.S. Recently you asked about estimates of usage of Plane 2
characters--since a large percentage are CNS 11643-1992 characters (and
perhaps the oldest IT source), that may provide a clue. In the
"Concluding Remarks" section of Christian Wittern's "Taming the
Masses"[1], the higher CNS planes (ignore 1 and 2, which are in the
BMP, and perhaps some parts of 3) are rarely used in historic texts, and
he expects even lower usage in modern texts.


Thomas Chan

This archive was generated by hypermail 2.1.2 : Mon Feb 04 2002 - 11:39:22 EST