RE: traditional vs simplified chinese

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Feb 13 2003 - 09:18:51 EST

Next message: Zhang Weiwu: "Re: traditional vs simplified chinese"

Previous message: Tom Gewecke: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
Maybe in reply to: Paul Hastings: "traditional vs simplified chinese"
Next in thread: John H. Jenkins: "Re: traditional vs simplified chinese"
Reply: John H. Jenkins: "Re: traditional vs simplified chinese"
Reply: Paul Hastings: "Re: traditional vs simplified chinese"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Paul Hastings wrote:
> i suppose this is a really simple minded question but is
> there any way of telling if an incoming chunk of text
> (say from a browser form) is traditional or simplified
> chinese?

Please notice that the classification you want is not always meaningful.
E.g., what if the incoming text is in Spanish? Would you classify it as
traditional or simplified Chinese?...

Anyway. You can obtain the base data for each Chinese character from the
file http://www.unicode.org/Public/UNIDATA/Unihan.txt, by checking the
existence of fields <kSimplifiedVariant> and <kTraditionalVariant>.

Any Unicode character, falls in one of these four categories:

0) All characters not listed in Unihan.txt (i.e., non-Chinese
characters) are *neither* "Traditional" nor "Simplified";

1) All characters having <kSimplifiedVariant> but *no*
<kTraditionalVariant> are "Traditional";

2) All characters having <kTraditionalVariant> but *no*
<kSimplifiedVariant> are "Simplified";

3) All other characters listed in Unihan.txt are *both*
"Traditional" and "Simplified".

From these character-level categories, you can assign a category to the
input stream:

If at least one character has category 1 AND at least one character
has category 2, then:

stream is both "Traditional" and "Simplified (category 3);

Else, if at least one character has category 1, then:

stream is "Traditional" (category 1);

Else, if at least one character has category 2, then:

stream is "Simplified" (category 2);

Else, if at least one character has category 3:

stream is both "Traditional" and "Simplified (category 3
again);

Else (all characters have category 0, then):

stream is neither "Traditional" nor "Simplified (category
0);

End.

Anyway, I don't see how this information could be of any use for any
purpose...

_ Marco

Next message: Zhang Weiwu: "Re: traditional vs simplified chinese"
Previous message: Tom Gewecke: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
Maybe in reply to: Paul Hastings: "traditional vs simplified chinese"
Next in thread: John H. Jenkins: "Re: traditional vs simplified chinese"
Reply: John H. Jenkins: "Re: traditional vs simplified chinese"
Reply: Paul Hastings: "Re: traditional vs simplified chinese"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 10:03:51 EST