TC/SC mapping

From: DougEwell2@cs.com
Date: Wed Jan 23 2002 - 02:06:54 EST

Previous message: John Cowan: "Re: Problems with viewing Hindi Unicode Page"
Next in thread: Marco Cimarosti: "RE: TC/SC mapping"
Reply: Marco Cimarosti: "RE: TC/SC mapping"
Reply: Thomas Chan: "RE: TC/SC mapping"
Reply: John H. Jenkins: "Re: TC/SC mapping"
Reply: Kenneth Whistler: "Re: TC/SC mapping"
Reply: DougEwell2@cs.com: "Re: TC/SC mapping"
Reply: Marco Cimarosti: "RE: TC/SC mapping"
Reply: John H. Jenkins: "Re: TC/SC mapping"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I am trying to improve my understanding of the relationship between
Traditional Chinese and Simplified Chinese, and the issues involved in
mapping between them, because of a persistent debate on the Internationalized
Domain Name (IDN) mailing list.

As an experiment, I tried to build a simple 1-to-1 table based on the
"kSimplifiedVariant" and "kTraditionalVariant" fields in the Unihan database.
In the process, I found some unusual entries that I hope one of the Unihan
experts on this list can explain for me.

Grepping through the Unicode 3.1 version of Unihan.txt, I discovered the
following:

    U+4E48 kSimplifiedVariant U+9EBD
    U+4E48 kTraditionalVariant U+9EBD
    ...
    U+540E kSimplifiedVariant U+5F8C
    U+540E kTraditionalVariant U+5F8C
    ...
    U+5F8C kSimplifiedVariant U+540E
    U+5F8C kTraditionalVariant U+540E
    ...
    U+9EBD kSimplifiedVariant U+4E48
    U+9EBD kTraditionalVariant U+4E48

This means that U+4E48 and U+9EBD are both simplified *and* traditional
variants of each other, and U+540E and U+5F86 are both simplified *and*
traditional variants of each other! Can this be true?

I also noticed:

    U+4F59 kSimplifiedVariant U+9980
    U+4F59 kTraditionalVariant U+9918
    ...
    U+9918 kSimplifiedVariant U+4F59
    ...
    U+9980 kTraditionalVariant U+4F59

which seems strange. If the simplified variant of U+4F59 is U+9980, and the
traditional variant of U+4F59 is U+9918, then what is U+4F59?

In the Unicode 3.2 (beta) Unihan file, there is a new twist: characters whose
traditional equivalent is given as TWO characters:

    U+836F kTraditionalVariant U+846F U+85E5
    ...
    U+8721 kTraditionalVariant U+8721 U+881F

The existence of these two entries does seem to lend some weight to the
argument that TC/SC equivalence is not a simple 1-to-1 operation like Latin
case mapping, which some are claiming it is. The second example is
particularly interesting (A -> A B).

Thanks in advance,

-Doug Ewell
Fullerton, California

Previous message: John Cowan: "Re: Problems with viewing Hindi Unicode Page"
Next in thread: Marco Cimarosti: "RE: TC/SC mapping"
Reply: Marco Cimarosti: "RE: TC/SC mapping"
Reply: Thomas Chan: "RE: TC/SC mapping"
Reply: John H. Jenkins: "Re: TC/SC mapping"
Reply: Kenneth Whistler: "Re: TC/SC mapping"
Reply: DougEwell2@cs.com: "Re: TC/SC mapping"
Reply: Marco Cimarosti: "RE: TC/SC mapping"
Reply: John H. Jenkins: "Re: TC/SC mapping"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Wed Jan 23 2002 - 01:51:51 EST