I am trying to improve my understanding of the relationship between
Traditional Chinese and Simplified Chinese, and the issues involved in
mapping between them, because of a persistent debate on the Internationalized
Domain Name (IDN) mailing list.
As an experiment, I tried to build a simple 1-to-1 table based on the
"kSimplifiedVariant" and "kTraditionalVariant" fields in the Unihan database.
In the process, I found some unusual entries that I hope one of the Unihan
experts on this list can explain for me.
Grepping through the Unicode 3.1 version of Unihan.txt, I discovered the
following:
U+4E48 kSimplifiedVariant U+9EBD
U+4E48 kTraditionalVariant U+9EBD
...
U+540E kSimplifiedVariant U+5F8C
U+540E kTraditionalVariant U+5F8C
...
U+5F8C kSimplifiedVariant U+540E
U+5F8C kTraditionalVariant U+540E
...
U+9EBD kSimplifiedVariant U+4E48
U+9EBD kTraditionalVariant U+4E48
This means that U+4E48 and U+9EBD are both simplified *and* traditional
variants of each other, and U+540E and U+5F86 are both simplified *and*
traditional variants of each other! Can this be true?
I also noticed:
U+4F59 kSimplifiedVariant U+9980
U+4F59 kTraditionalVariant U+9918
...
U+9918 kSimplifiedVariant U+4F59
...
U+9980 kTraditionalVariant U+4F59
which seems strange. If the simplified variant of U+4F59 is U+9980, and the
traditional variant of U+4F59 is U+9918, then what is U+4F59?
In the Unicode 3.2 (beta) Unihan file, there is a new twist: characters whose
traditional equivalent is given as TWO characters:
U+836F kTraditionalVariant U+846F U+85E5
...
U+8721 kTraditionalVariant U+8721 U+881F
The existence of these two entries does seem to lend some weight to the
argument that TC/SC equivalence is not a simple 1-to-1 operation like Latin
case mapping, which some are claiming it is. The second example is
particularly interesting (A -> A B).
Thanks in advance,
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.2 : Wed Jan 23 2002 - 01:51:51 EST