Re: TC/SC mapping

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Wed Jan 23 2002 - 15:38:49 EST


On Wed, 23 Jan 2002, John H. Jenkins wrote:

> On Wednesday, January 23, 2002, at 09:05 AM, Thomas Chan wrote:
> > In other words,
> > yao1 'small' TC U+4E48 or U+5E7A -> SC U+4E48
> > me (as in shen2me 'what') TC U+9EBC or U+9EBD -> SC U+4E48
> > mo2 (as in yao1mo2 'insignificant') TC U+9EBC or U+9EBD -> SC U+9EBD
>
> Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being
> different? The only dictionary I have which contains both is the
> (traditional) CiHai, it and it claims they're variants of each other.

Well, first, the "Jianhuazi Zongbiao" that defines the PRC
simplifications juxtaposes U+9EBD and U+4E48 for the "me" pronunciation
of the former (non-"me" usage of the former are not simplified);
U+9EBC is not mentioned.

In the PRC's _Ci Hai_ from 1979 (the third dictionary to bear that
name), U+9EBC is a pointer to U+9EBD for all usages of U+9EBD.

In the _Hanyu Da Zidian_ (PRC, 1986), U+9EBD has the following
usages:
  1) mo2 'small'
  2) ma2 of gan4ma2 'what for'. (It says that nowadays this
       particular ma2 is written U+55CE.)
  3.1) ma, a particle, which can sometimes be written U+55CE.
  3.2) ma, a particle, which can sometimes be written U+561B.
  4) me of zhe4me 'so; like this'; also used as padding in songs.

However, for U+9EBC, it says it is the same as U+9EBD, but the
only examples given have the 'small' meaning, including one from
the _Shuowen Jiezi_ (China, AD 100) that says that U+9EBD is a
vulgar (su2) form of U+9EBC.

Apparently, U+9EBC is the more orthodox version as far as mo2
'small' is concerned, but U+9EBD has become more common,
including becoming used to write various modern/colloquial words.

I would revise the mapping as follows:
  me (as in shen2me 'what') TC U+9EBD -> SC U+4E48
  mo2 (as in yao1mo2 'insignificant') TC U+9EBC -> TC U+9EBD -> SC U+9EBD

I think the choice whether to regard U+9EBC and U+9EBD as different or not
depends on the application. I would lean towards treating them as the
same.

 
> Meanwhile, both Sanseido and KangXi say that U+5C1B (尛) is a member of
> the family. (KangXi says that anciently U+9EBC (麼) was written U+5C1B (尛)
> . Mathews and Sanseido also remind us that U+5E85 (庅) is another variant,
> and Sanseido *also* lists U+5692 (嚒).

In the _Hanyu Da Zidian_, U+5C1B points to U+9EBC. (I see on the same
page that U+21B6F also points to U+9EBC, and the _Hanyu Da Zidian_ is
citing this pointer from the same source.) It doesn't say, but I would
presume these refer only to the original mo2 'usage', given the age
of the cited source, _Longkan Shoujian_ (China, AD 997), and the
composition of U+5C1B (three 'smalls') and U+21B6F ('three' + 'small').

U+5E85 is understandable as an abbreviated form of U+9EBD, and I'll
add that it's also documented in Samuel Wells Williams' 1874
dictionary (pushes back the usage given in Mathews by at least half a
century).

U+5692 seems understandable--it is just U+9EBC with a mouth radical
tacked on--I presume this is only for the modern/colloquial "me" usages,
and not mo2 'small'. (I wouldn't be surprised if somewhere there is
attested a U+9EBD with a mouth radical tacked on.)

I would further revise the (partial) mapping as follows:

  me (as in shen2me 'what'):
    TC U+9EBC -> TC U+9EBD -> TC U+5E85 -> SC U+4E48
    TC U+9EBC -> TC U+5692

  mo2 (as in yao1mo2 'insignificant'):
    TC U+9EBC -> TC U+9EBD -> SC U+9EBD

And this is not finished, yet! The _Hanyu Da Zidian_ also lists
some other variant forms of U+9EBD--I suspect they are probably
all/mostly for the mo2 'small' usage. I should point out that the _Hanyu
Da Zidian_ is in no way the final word despite its comprehensiveness,
e.g., U+5E85 and U+5692 are not included in it.

 
> So, Doug, you see that U+4E48 (么) could conceivably be a traditional
> character in its own right *or* the simplified form for no fewer than six
> (!) other ideographs.
>
> This is the kind of mess that has discouraged anybody from doing a
> systematic survey of simplifications for the Unihan database.

Part of this is because there is the orthogonal complexity of variant TC
forms. Before converting TC to SC, one should resolve all TC variants to
the most "common" or "standard" TC form (good luck deciding what that
means). e.g., in the above case, resolve to U+9EBD.

I think we are also complicating things by treating the entire process of
variants and simplifications as operating solely on the orthography (cf.,
upper and lower case); in some cases, it is simpler to conceptualize it as
the "spelling" of words being changed.

 
> > The other example (U+8721 kTraditionalVariant U+8721 U+881F) is a
> > mistake--the TraditionalVariant should only be U+881F.
>
> Actually, no. Both KangXi and the Cihai list U+8721 (蜡) as a traditional
> character in its own right, although I assume it's rare as I can't find it
> in my other dictionaries.

You're right. The presence of U+8721 in Big5 should have been a
preliminary hint to me that it may have had non-simplified usage.

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Wed Jan 23 2002 - 15:36:59 EST