Re: CJK fonts

From: Thomas Chan ([email protected])
Date: Mon Dec 16 2002 - 17:24:48 EST

Next message: Barry Caplan: "Re: Documenting in Tamil Computing"

Previous message: Andrew C. West: "Re: Mongolian Encoding"
In reply to: Andrew C. West: "Re: CJK fonts"
Next in thread: Andrew C. West: "Re: CJK fonts"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

(I've merged Andrew's two messages--12/13 and 12/16--together, below.)

On Fri, 13 Dec 2002, Andrew C. West wrote:
> On Fri, 13 Dec 2002 01:33:08 -0800 (PST), Thomas Chan wrote:
> > I can't imagine where the yi4 reading comes from, although I note
>
> I was thinking along the same lines. The Kangxi Zidian gives U+3CBC a reading of
> YI4 (as does the Unihan database - the CHA4 reading seems to be as a variant
> form of U+6C4A).

What edition of the _Kangxi Zidian_ are you using that gives explicit
Mandarin readings like "yi4", or are you interpreting the fanqie notation
yourself? I use the 1958 edition, 1997 2nd printing published by
Zhonghua, ISBN 7-101-00518-7.

I find self-interpretation of fanqie to be fraught with peril, partially
as fanqie was never a completely perfect transcription system, not to
mention that fanqie from old dictonaries does not necessarily tell one
anything about contemporary pronunciation.

e.g., U+5B7B, is a Yue (Cantonese), Hakka, and Min character, meaning
'last (child)' (derived from 'last child of an old man', hence the
character's appearance as 'child' + 'to use up'), pronounced laai1 or lai1
in Cantonese.[1] However, the old dictionaries including Kangxi give a
fanqie of U+6CE5 U+53F0 U+5207, which would yield an artificial nai2 in
Mandarin, which is exactly what the _Hanyu Da Zidian_ says explicitly.
Either the pronunciation has changed from [n-] and [l-] and reading old
dictionaries fails to account for modern developments, or whoever choose
U+6CE5 to indicate the onset was pronouncing U+6CE5 as *l-.

[1] While there is a long-standing ongoing sound change in Cantonese from
[n-] to [l-], this is probably no longer one of them, and *naai1/nai1
would now be regarded as hypercorrection.

[...]

> At any rate, what I think is important is that we do not assume that YI4
> is wrong and through it out just because none of us recognise the reading ...
> though I guess if it is that obscure, it really hasn't got a place in the Unihan
> database.

But what if the character is obscure, and the reading thusly also obscure?
I think there are diminishing benefits to overly-proofing the
unihan database for such characters--if they are so rare, then no one will
find the character by searching on an obscure/artificial reading, and if
it is so rare, then those interested should be consulting actual
comprehensive dictionaries (like the Kangxi or _Hanyu Da Zidian_) instead
of relying on a text file. In a way, we currently have this
situation--the Plane 2 characters are, on average, more obscure than the
BMP characters, and the lack of information is kind of saying "look it up
yourself if you really, really need to know".

> If Hanyu Da Zidian and Hanyu Da Cidian both give GAN4 for the modern
> reading of U+5481 I for one would prefer that reading to GEM4. Ci Hai
> also has such non-Mandarin syllables as NGU2 for U+5514. The principle
> of Pinyin are clearly defined (and like most PRC dictionaries Ci Hai
> includes a copy of the Hanyu Pinyin Fang'an as an appendix - even if it
> does not fully adhere to it), and syllables like GEM4 and NGU2 are
> simply not allowed.

I agree with your sentiment that "gem4" is an aberration, despite my
support of the _Cihai_ (PRC 1979) in that it did not get included in the
unihan database from out of nowhere. When U+5481 was reinvented by the
Cantonese, it was patterned both graphically and phonologically on U+7518,
which is gan1 'sweet' in Mandarin (gam1 in Cantonese). U+5481 is in
Cantonese gam3 'so (quantity)' (3 = yinqu tone); hence "gan4" is an
appropriate Mandarin reflex.

"ngu2" for U+5514 is also an aberration--yet another case of a quixotic
attempt to mimic dialect pronunciation in Mandarin. Sure, it's m4 (a
syllabic nasal [m]) 'not' in Cantonese, but this is just a re-use of a
pre-existing semi-homophonous character, ng4 (another syllabic nasal;
considered close enough to m4 in Cantonese), a sound in singing. As that
is wu2 in Mandarin, so thus should 'not' be given an artificial *wu2
reading (which is what the unihan database has currently--no doubt that
piece of data was inputted from a more sensible dictionary).

But elsewhere, this battle is lost--U+5187 'to not have' (among other
meanings), is perhaps the most recognizable Cantonese character to
non-Cantonese, is given nowadays given the pronunciation mao3[2], despite
the recognition of earlier dictionary compilers such as Samuel Wells
Williams in his 1877 dictionary who recognized it as derived from U+7121
with a tone change, and assigned it a Mandarin wu3 reading accordingly.

[2] I note that even "mao" is a poor approximation; "*mou" would've been
closer (and still a valid and normal Mandarin syllable).

> On the other hand, a reading of FIAO4 for the dialectal ideograph U+8985
> may sound odd to a Mandarin speaker, but it is perfectly acceptable
> according to the rules of Pinyin ("F" is a valid initial, and "IAO4" is
> a valid final). FIAO4 is the only reading for this ideograph given in
> Hanyu Da Cidian, Ci Hai and Xiandai Hanyu Cidian, but interestingly,
> Unihan gives it a reading of BIAO4 - not sure where that reading comes
> from.

Thank you for pointing out this Wu character to me.[3] The artificial
Mandarin reading of this character is a difficult case. Both the _Cihai_
and the _Hanyu Da Zidian_ seem to say that this character, which is a
contraction of 'do not want', is not a typical Wu syllable, though
apparently pronouncable (syllables existing on the borderline also exist
in Cantonese phonology, typically in loanwords or onomatopoeia), and
therefore U+8985 had to be created as a "ligature" of sorts by squishing
the constituents U+52FF U+8981 into the space normally occupied by one
character. Therefore, I don't think it odd that there is a
semi-questionable Mandarin fiao4 reading. The _Hanyu Da Zidian_ does not
try to give a Mandarin reading in this case, so we still don't know where
"biao4" came from in the unihan database (or "po4", for that
matter--unless that is a case of shuffled data that started this whole
thread). I do note that besides U+8985, a similar-looking interchangeable
character (but with the halves swapped) is right next to it in _Hanyu Da
Zidian_. On the Wu pronunciation, I can't comment on it myself, except
that I see in the _Hanyu Fangyan Cihui_, 2nd ed. that in Suzhou, they say
[fiae] ("ae" = <ae> ligature) and in Wenzhou, they say [fai]; however, for
the latter city, it is said to be a contraction of U+5426 U+8981 instead.
So in a way, the [f-] of the Mandarin reading is justifable; (I don't know
enough to comment on the rest of the syllable or tone choice.

unihan.txt says that U+8985 is in Morohashi--perhaps that might be where
"biao4" came from?--I don't have access to a Morohashi to check.

[3] A nice example of the sporadic and often accidental coverage of
non-Mandarin and non-Yue (Cantonese) characters in Unicode. Wu's U+8985
is in the BMP, yet a contemporary Mandarin character such as cei3
'ugly'/cei4 'to hit' winds up in Plane 2 as U+24B62.

>Sorry if this is getting somewhat OT.

The same here. I'm fine with taking this privately, but I thought there
might be some interest in sharing it here, as there are people who are
using kMandarin quite literally as a "informative" field as their
primary/sole data...

Thomas Chan
[email protected]

Next message: Barry Caplan: "Re: Documenting in Tamil Computing"
Previous message: Andrew C. West: "Re: Mongolian Encoding"
In reply to: Andrew C. West: "Re: CJK fonts"
Next in thread: Andrew C. West: "Re: CJK fonts"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 16 2002 - 13:54:56 EST