Re: Erratum in Unicode book

From: Richard Cook (rscook@socrates.berkeley.edu)
Date: Mon Jul 09 2001 - 21:33:23 EDT


Thomas Chan wrote:
>
> On Mon, 9 Jul 2001, Richard Cook wrote:
>
> > On a related note, I have 9000 word/char frequencies from Hanyu Pinlu
> > Cidian (a mainland text; I typed the entries in back in the early 90's,
> > and this is the freq data currently used in Wenlin). I'd be happy to
> > give the Consortium access to this data for the purpose of sorting
> > characters with identical rad/str numbers by frequency.
>
> Wouldn't that bias sorting according to Chinese language usage
> frequencies? e.g., \u7684, \u4f60, \u5403 are very common in Chinese, but
> rare or obscure in Japanese. Subsorting by pronuniciation would also be
> language-dependent.

I thought of the lang. dependency for freq. ... but I don't know. The
Kang Xi Radical system itself is biased toward Chinese usage, albeit a
widespread one.

But it might be interesting to get frequency lists for representative
CJKV usages, and average them for index sorting. How'd that be for
unbiased? John? Want to start collecting that data? :-)
>
> For a language-neutral method of sorting characters with otherwise the
> same radical and # of residual strokes, how about the method used in the
> _Hanyu Da Zidian_ (and some other dictionaries) of sorting by the type of
> stroke of the first stroke, second stroke, etc., by whether it is one of
> the five basic types of strokes as exemplified in the first five Kangxi
> radicals? This requires such data be available for all 70,000+
> characters, though...

This is a good idea too.



This archive was generated by hypermail 2.1.2 : Mon Jul 09 2001 - 20:19:47 EDT