From: Edward H Trager (ehtrager@umich.edu)
Date: Thu Feb 13 2003 - 11:06:22 EST
Hi, Paul,
On Thu, 13 Feb 2003, Zhang Weiwu wrote:
> ----- Original Message -----
> From: "Paul Hastings" <paul@tei.or.th>
> To: "Zhang Weiwu" <weiwuzhang@hotmail.com>
> Sent: Thursday, February 13, 2003 9:16 PM
> Subject: Re: traditional vs simplified chinese
>
> > >meaning "for" (wei in Mandarin pinyin) is the most significant recognizable
> > >one.
>
> Take it easy, if you find one 500B (the measure word) it is usually
> enough to say it is traditional Chinese, one 4E2A (measure word) is in
> simplified Chinese. They never happen together in a logically correct
> document.
So I think Zhang Weiwu is suggesting a heuristic algorithm for
discriminating a unicode text which is already known, or assumed to be, in
Chinese.
If I were going to write such an algorithm, I would:
* First, insure that the incoming text stream to be classified was
sufficiently long to be probabilistically classifiable. In other
words, what's the shortest stream of Hanzi characters needed, on
average, in a typical Chinese text (on the web, for example) in order
to encounter at least one "ge" u+500B or u+4E2A? One "wei" u+70BA or
u+4E3A? One "shuo" u+8AAC or u+8BF4? It wouldn't take long to figure
this out.
* Secondly, as I imply above, I would test for the occurrences of
multiple common characters like "ge" u+500B, "wei" u+70BA, "shuo" u+8AAC.
Again, if I were doing this, I would want to know, statistically,
what are really the most common characters? Maybe the top 10 most
common characters would be sufficient.
In practice, such an algorithm would probably work very well. But, as
Marco Cimarosti has questioned, why do you need to classify text as being
simplified or traditional?
One reason I could think of doing that would be as a convenience for
visitors to a web site whose source documents were a mix of traditional
and simplified Chinese. Take, for example, a site that provided links to
news from Mainland, Taiwan, HK, etc. So, a visitor could choose whether
he wanted to see the site in traditional or simplified characters. It
wouldn't matter whether the source documents were in simplified or
traditional characters. The classification algorithm would classify a
document on the fly before display.
Based on the classification, a "conversion" algorithm would swap the set
of most common characters that are visually different between jianti
(simplified) and fanti (traditional) zi using a simple lookup table. I
don't remember how big this set of characters is. It wouldn't have to be
complete. And I would intentionally avoid the "problematic" characters --
i.e., those simplified characters that can map back to several different
traditional characters having different meanings. Converting just the
most common, non-problematic characters between simplified and traditional
would already be sufficient for fluent readers to guess, decipher, or
recall from the depths of their memories those few unconverted characters
with which they may be unfamiliar with reading.
So, basically all you would be doing is providing a convenience for your
readers, making it easier on their eyes to read your web documents in
either traditional or simplified according to their preference. I know
that something like that would help me -- sometimes I forget the
traditional version of a character, and sometimes I forget the simplified
version. It would be very cool if I could just press a button on a web
site to switch the display between the two ;-) .
This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 11:51:55 EST