From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Feb 13 2003 - 13:40:35 EST
Edward H Trager wrote:
> [...]
> If I were going to write such an algorithm, I would:
>
> * First, insure that the incoming text stream to be classified was
> sufficiently long to be probabilistically classifiable. In other
> words, what's the shortest stream of Hanzi characters needed, on
> average, in a typical Chinese text (on the web, for example) in
> order to encounter at least one "ge" u+500B or u+4E2A? One "wei"
> u+70BA or u+4E3A? One "shuo" u+8AAC or u+8BF4? It wouldn't take
> long to figure this out.
Lucky man! I was discussing about a similar subject just yesterday, and
someone came up with this link:
http://lingua.mtsu.edu/chinese-computing/statistics/
The figures in file <total.html> make it easy to answer your question: in a
typical text, ? (ge) is the 3.54%, ? (wei) the 1.96%, ? (shuo) the 2,58%,
etc.
_ Marco
This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 14:37:28 EST