RE: TC/SC mapping

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Wed Jan 23 2002 - 11:05:00 EST

Previous message: Patrick Andries: "Re: [Very-OT] Re: �"
Maybe in reply to: DougEwell2@cs.com: "TC/SC mapping"
Next in thread: John H. Jenkins: "Re: TC/SC mapping"
Next in thread: John H. Jenkins: "Re: TC/SC mapping"
Reply: John H. Jenkins: "Re: TC/SC mapping"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Wed, 23 Jan 2002, Marco Cimarosti wrote:
> Doug Ewell wrote:
> > U+4E48 kSimplifiedVariant U+9EBD (1)
> > U+4E48 kTraditionalVariant U+9EBD (2)
> > U+9EBD kSimplifiedVariant U+4E48 (7)
> > U+9EBD kTraditionalVariant U+4E48 (8)
> >
> > This means that U+4E48 and U+9EBD are both simplified *and*
> > traditional variants of each other, and U+540E and U+5F86
> > are both simplified *and* traditional variants of each
> > other! Can this be true?
>
> As a matter of fact, U+4E48 (銋�) is the simplified form of U+9EBD (暻�). So I
> guess that kSimplifiedVariant field for U+4E48 and the kTraditionalVariant
> field for U+9EBD are mistakes, and should simply be removed.

Yes, (1) and (8) are mistakes. (2) is correct if U+4E48 is being used to
write "me", otherwise, if it is being used to write yao1 'small', then it
its "traditional" form is U+5E7A or U+4E48 (itself). (7) is correct.

In other words,
  yao1 'small' TC U+4E48 or U+5E7A -> SC U+4E48
  me (as in shen2me 'what') TC U+9EBC or U+9EBD -> SC U+4E48
  mo2 (as in yao1mo2 'insignificant') TC U+9EBC or U+9EBD -> SC U+9EBD

This set is rather complex, since in the TC->SC process, you have 1)
reduction of the number of variant TC forms (all three cases), 2)
overloading a character (writing both yao1 and me with the character
formerly used only for yao), and 3) conditional simplification
(U+9EBC/U+9EBD simplified differently depending on what word they are used
for writing).

> > U+540E kSimplifiedVariant U+5F8C (3)
> > U+540E kTraditionalVariant U+5F8C (4)
> > ...
> > U+5F8C kSimplifiedVariant U+540E (5)
> > U+5F8C kTraditionalVariant U+540E (6)

(3) appears to be a mistake. (4) is correct, if U+540E is being used to
write hou4 'after', otherwise, if U+540E is being used to write
hou4 'queen', then it should map to itself. (5) is correct--U+5F8C, used
to write hou4 'after'. (6) also appears to be a msitake.

In other words,
hou4 'queen' TC U+540E -> SC U+540E
hou4 'after' TC U+5F8C -> SC U+540E
This sort of conflation of multiple TC forms to one overloaded SC form is
common.

> > I also noticed:
> > U+4F59 kSimplifiedVariant U+9980 (9)
> > U+4F59 kTraditionalVariant U+9918 (10)
> > ...
> > U+9918 kSimplifiedVariant U+4F59 (11)
> > ...
> > U+9980 kTraditionalVariant U+4F59 (12)
> >
> > which seems strange. If the simplified variant of U+4F59 is
> > U+9980, and the traditional variant of U+4F59 is U+9918,
> > then what is U+4F59?
>
> Perhaps U+4F59 (雿�) is the *Japanese* simplified form, while U+9980 (擐�) is
> the *Chinese* simplified form, both corresponding to the traditional form
> U+9918 (擗�).

Let's not involve Japanese simplifications, if possible. (9) appears to
be a mistake. (10) is correct, if U+4F59 is being used to write yu2
'surplus', otherwise, if it is being used to write '(archaic) I', then it
should map to itself. (11) is correct, if U+9918 is being used to write
yu2 'surplus', otherwise, if it is being used to write the surname Yu2,
then it should map to U+9980. (12) is partially correct; what it is
trying to say is that U+9980 when used atypically to write 'surplus', a
conversion to a more "traditional" form would take it to U+4F59 (and
ultimately, U+9918).

In other words,
  yu2 '(archaic) I' TC U+4F59 -> SC U+4F59
  yu2 'surplus' TC U+9918 -> SC U+4F59
  (yu2 'surplus' TC U+9918 -> SC U+9980)
  Yu2 (surname) TC U+9918 -> SC U+9980

> A very well-known case of such triplets is the verb "to sell": Japanese
> simplified form U+58F2 (憯�), Chinese simplified form U+5356 (��), traditional
> form U+8CE3 (鞈�).
> However, in this case, UniHan seems to express the relationship with the
> Japanese form only through the kZVariant field:
> U+05356 kTraditionalVariant U+08CE3
> U+05356 kZVariant U+08CE3
> ...
> U+058F2 kZVariant U+08CE3
> ...
> U+08CE3 kSimplifiedVariant U+05356
> U+08CE3 kZVariant U+058F2

That is potentially confusing, as Japan has its own set of "traditional"
and "simplified" forms, which are being expressed with kZVariant; however,
at other times, kZVariant expresses fully or semi-interchangeable TC forms
(and I presume possibly fully/semi-interchangeable SC forms or
non-standard SC forms), as well as forms that otherwise would have been
unified had it not been for source separation.

> > In the Unicode 3.2 (beta) UniHan file, there is a new twist:
> > characters whose traditional equivalent is given as TWO
> > characters:
> >
> > U+836F kTraditionalVariant U+846F U+85E5 (13)
>
> Gulp! But the format information included in the file reads:
>
> # kTraditionalVariant
> # The Unicode value for a (Chinese) traditional
> variant for this character.
>
> So, there should be *a*, that is *one*, traditional variant...

It seems like the kTraditionalVariant field was defined too strictly.
(13) is correct:
yao4 'medicine' TC U+85E5 -> SC U+836F
yue4 'Dahurian angelica' TC U+846F -> SC U+836F

The other example (U+8721 kTraditionalVariant U+8721 U+881F) is a
mistake--the TraditionalVariant should only be U+881F.

Thomas Chan
tc31@cornell.edu

Previous message: Patrick Andries: "Re: [Very-OT] Re: �"
Maybe in reply to: DougEwell2@cs.com: "TC/SC mapping"
Next in thread: John H. Jenkins: "Re: TC/SC mapping"
Next in thread: John H. Jenkins: "Re: TC/SC mapping"
Reply: John H. Jenkins: "Re: TC/SC mapping"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Wed Jan 23 2002 - 10:37:15 EST