RE: traditional vs simplified chinese

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Feb 13 2003 - 13:40:35 EST

Next message: Marco Cimarosti: "RE: Indic Vowel/Consonant combinations"

Previous message: Tom Gewecke: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
Maybe in reply to: Paul Hastings: "traditional vs simplified chinese"
Next in thread: Rick Cameron: "RE: traditional vs simplified chinese"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Edward H Trager wrote:
> [...]
> If I were going to write such an algorithm, I would:
>
> * First, insure that the incoming text stream to be classified was
> sufficiently long to be probabilistically classifiable. In other
> words, what's the shortest stream of Hanzi characters needed, on
> average, in a typical Chinese text (on the web, for example) in
> order to encounter at least one "ge" u+500B or u+4E2A? One "wei"
> u+70BA or u+4E3A? One "shuo" u+8AAC or u+8BF4? It wouldn't take
> long to figure this out.

Lucky man! I was discussing about a similar subject just yesterday, and
someone came up with this link:

http://lingua.mtsu.edu/chinese-computing/statistics/

The figures in file <total.html> make it easy to answer your question: in a
typical text, ? (ge) is the 3.54%, ? (wei) the 1.96%, ? (shuo) the 2,58%,
etc.

_ Marco

Next message: Marco Cimarosti: "RE: Indic Vowel/Consonant combinations"
Previous message: Tom Gewecke: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
Maybe in reply to: Paul Hastings: "traditional vs simplified chinese"
Next in thread: Rick Cameron: "RE: traditional vs simplified chinese"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 14:37:28 EST