From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Thu Feb 13 2003 - 13:29:06 EST
On Thu, 13 Feb 2003 09:48:45 -0800 (PST), "Zhang Weiwu" wrote:
> Take it easy, if you find one 500B (the measure word) it is usually enough to
> say it is traditional Chinese, one 4E2A (measure word) is in simplified
> Chinese. They never happen together in a logically correct document.
Marco is absolutely correct that Simplified and Traditional Chinese may
legitimately be found together on the same Web page (and I for one have several
pages where they do).
Just adding my two fens worth, Traditional/Simplified is an artificial modern
distinction that has been exacerbated by the GB simplified-only coding standards
on the one hand and traditional-only coding standards such as Big5 on the other,
which forced people to use either Simplified or Traditional characters
exclusively. Most simplified characters have in fact been around for centuries,
and if you open the pages of any down-market commercial edition of a Chinese
book printed during the Yuan, Ming or Qing dynasties (last 700 years) you are
likely to find plenty of "simplified" forms mixed up with "traditional" forms.
Certainly, I've seen "traditional" texts which mix U+500B with U+4E2A (and with
U+7B87 for that matter). With Unicode it is now possible to transcribe
traditional texts as they are written, rather than translate into "traditional"
or "simplified". Take, for example, this Web page --
http://uk.geocities.com/Morrison1782/Texts/TianguanCifu.html -- which
transcribes a short one-act play from the Cantonese Opera tradition, published
during the Qing dynasty (probably early 19th century). It has U+4E2A (simplified
ge4) but not U+500B (traditional ge4), and yet is written mostly in
"traditional" characters. How would your algorithm classify such a page ?
Also, you should remember that a Chinese page written in Classical Chinese --
and there are plenty of electronic editions of the Classics on the Web -- might
have no instances of the vernacular character ge4 at all.
Andrew
This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 14:11:12 EST