From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Feb 13 2003 - 09:18:51 EST
Paul Hastings wrote:
> i suppose this is a really simple minded question but is
> there any way of telling if an incoming chunk of text
> (say from a browser form) is traditional or simplified
> chinese?
Please notice that the classification you want is not always meaningful.
E.g., what if the incoming text is in Spanish? Would you classify it as
traditional or simplified Chinese?...
Anyway. You can obtain the base data for each Chinese character from the
file http://www.unicode.org/Public/UNIDATA/Unihan.txt, by checking the
existence of fields <kSimplifiedVariant> and <kTraditionalVariant>.
Any Unicode character, falls in one of these four categories:
0) All characters not listed in Unihan.txt (i.e., non-Chinese
characters) are *neither* "Traditional" nor "Simplified";
1) All characters having <kSimplifiedVariant> but *no*
<kTraditionalVariant> are "Traditional";
2) All characters having <kTraditionalVariant> but *no*
<kSimplifiedVariant> are "Simplified";
3) All other characters listed in Unihan.txt are *both*
"Traditional" and "Simplified".
From these character-level categories, you can assign a category to the
input stream:
If at least one character has category 1 AND at least one character
has category 2, then:
stream is both "Traditional" and "Simplified (category 3);
Else, if at least one character has category 1, then:
stream is "Traditional" (category 1);
Else, if at least one character has category 2, then:
stream is "Simplified" (category 2);
Else, if at least one character has category 3:
stream is both "Traditional" and "Simplified (category 3
again);
Else (all characters have category 0, then):
stream is neither "Traditional" nor "Simplified (category
0);
End.
Anyway, I don't see how this information could be of any use for any
purpose...
_ Marco
This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 10:03:51 EST