From: Ed Trager (ed.trager@gmail.com)
Date: Wed Jan 27 2010 - 15:29:53 CST
Hi, Unicoders,
I'm trying to get "reliable" Mandarin pronounciation data for as many
Chinese characters as possible ... (Heh heh heh ... I've read UAX#38
and already know how difficult this task may be, but I'm trying
anyway!). So, related to this, I have a few questions:
=> First, what is the source of the "kMandarin" field in Unihan.txt?
UAX#38 seems not to say ...
UAX#38 says this about the "kMandarin" field:
"The Mandarin pronunciation(s) for this character in pinyin;
Mandarin pronunciations
are sorted in order of frequency, not alphabetically."
Presumably this means that when multiple pronounciations are given,
the *first* pronounciation represents the most common pronounciation.
Looking at common, well-known characters in the Unified CJK section,
this indeed seems to be the case, eg:
| U+767D | 白 | bai2 bo2 |
| U+8BF4 | 说 | shuo1 shui4 tuo1 yue4 |
| U+8461 | 葡 | pu2 bei4 |
When I compare the "kMandarin" field with the "kHanyuPinyin" field, I
notice that "kHanyuPinyin" has Mandarin pronounciation data for a
greater number of characters (34,131 for kHanyuPinyin vs. 25,549 for
kMandarin). So, on first look, one is tempted to use the kHanyuPinyin
field as a source of pronounciation data since it has data on 8,582
additional characters. BUT, on closer inspection, we find in the
"kHanyuPinyin" field that the order of the pronounciation data does
not seem to be uniformly based on frequency. For example:
| UVALUE | HAN| kMandarin | kHanyuPinyin |
| U+8461 | 葡 | pu2 bei4 | bei4 pu2 | <= (pu2
is the common pronounciation for this character)
| U+4F46 | 但 | dan4 | tan3 dan4 yan4 | <= (dan4 is
the common pronounciation for this character)
UAX#38 says that the order of the pronounciations in kHanyuPinyin are
the same as in 漢語大字典 "(for the most part reflecting relative
commonality)" : Based on that wording and my inspection of
representive entries, I guess that I should avoid relying on
kHanyuPinyin for any kind of reliable relative pronounciation
frequency information.
Looking at the subset of characters that have exactly one given
pronounciation in both kMandarin and kHanyuPinyin, I find there are
870 disagreements between kHanyuPinyin and kMandarin, distributed as
follows:
+-----------+--------------+-------+
| section | section | cases |
+-----------+--------------+-------+
| 扩展A | Extension A | 184 |
| 扩展B | Extension B | 21 |
| 統一字 | Unified CJK | 665 |
+-----------+--------------+-------+
I guess I expected to find discrepancies among the rare hanzi of
Extension A and Extension B. But I'm a bit more surprised to see the
number of discrepancies among the more "common" CJK in the Unified CJK
section.
Some of these are just discrepancies in tone, as in these cases (from
the Unified CJK subset):
| U+5862 | 塢 | wu4 | wu3 |
| U+58A6 | 墦 | fan2 | fan1 |
| U+8FBF | 辿 | chan1 | chan2 |
These characters are certainly less common (in Mandarin, at least)
even if they are in the Unified CJK section. And there are also cases
where kMandarin and kHanyuPinyin disagree completely (again, selecting
examples from the Unified CJK subset):
| U+56F2 | 囲 | wei2 | tong1 |
| U+50E0 | 僠 | fan1 | bo1 |
| U+4FE4 | 俤 | ti4 | di4 |
Of course I've considered the possibility that some of my results may
be artifacts due to the fact that 漢語大字典 has "multiple locations" --
but note that I picked only cases with exactly one pronounciation in
kHanyuPinyin and likewise only one pronounciation in kMandarin. So I
think my results are correct unless I somehow messed things up when
prepping the data or running my SQL queries ... Unfortunately, I don't
have a paper copy of 漢語大字典 to use for backup verification of my
electronic manipulations.
In any case, UAX#38 gives this example, inter alia:
| U+5364 | 卤 | 10093.130: xī,lǔ 74609.020: lǔ,xī |
...which again seems to confirm that there really is no information in
the kHanyuPinyin field to tell me whether "卤" is more commonly
pronounced "xī" or "lǔ". Am I right?
Finally, in light of my original quest as stated at the beginning of
this email, does it make sense to follow this procedure:
(1) Preferenatially use pronounciation data from kMandarin
whenever available
(2) Fall back to kHanyuPinyin data when kMandarin is missing
... or is there an altogether better way to do this?
Best - Ed
This archive was generated by hypermail 2.1.5 : Wed Jan 27 2010 - 15:34:23 CST