What is the Source of "kMandarin" in Unihan.txt?

From: Ed Trager (ed.trager@gmail.com)
Date: Wed Jan 27 2010 - 15:29:53 CST

  • Next message: John H. Jenkins: "Re: What is the Source of "kMandarin" in Unihan.txt?"

    Hi, Unicoders,

    I'm trying to get "reliable" Mandarin pronounciation data for as many
    Chinese characters as possible ... (Heh heh heh ... I've read UAX#38
    and already know how difficult this task may be, but I'm trying
    anyway!). So, related to this, I have a few questions:

    => First, what is the source of the "kMandarin" field in Unihan.txt?
    UAX#38 seems not to say ...

    UAX#38 says this about the "kMandarin" field:

          "The Mandarin pronunciation(s) for this character in pinyin;
    Mandarin pronunciations
          are sorted in order of frequency, not alphabetically."

    Presumably this means that when multiple pronounciations are given,
    the *first* pronounciation represents the most common pronounciation.
    Looking at common, well-known characters in the Unified CJK section,
    this indeed seems to be the case, eg:

    | U+767D | 白 | bai2 bo2 |
    | U+8BF4 | 说 | shuo1 shui4 tuo1 yue4 |
    | U+8461 | 葡 | pu2 bei4 |

    When I compare the "kMandarin" field with the "kHanyuPinyin" field, I
    notice that "kHanyuPinyin" has Mandarin pronounciation data for a
    greater number of characters (34,131 for kHanyuPinyin vs. 25,549 for
    kMandarin). So, on first look, one is tempted to use the kHanyuPinyin
    field as a source of pronounciation data since it has data on 8,582
    additional characters. BUT, on closer inspection, we find in the
    "kHanyuPinyin" field that the order of the pronounciation data does
    not seem to be uniformly based on frequency. For example:

    | UVALUE | HAN| kMandarin | kHanyuPinyin |
    | U+8461 | 葡 | pu2 bei4 | bei4 pu2 | <= (pu2
    is the common pronounciation for this character)
    | U+4F46 | 但 | dan4 | tan3 dan4 yan4 | <= (dan4 is
    the common pronounciation for this character)

    UAX#38 says that the order of the pronounciations in kHanyuPinyin are
    the same as in 漢語大字典 "(for the most part reflecting relative
    commonality)" : Based on that wording and my inspection of
    representive entries, I guess that I should avoid relying on
    kHanyuPinyin for any kind of reliable relative pronounciation
    frequency information.

    Looking at the subset of characters that have exactly one given
    pronounciation in both kMandarin and kHanyuPinyin, I find there are
    870 disagreements between kHanyuPinyin and kMandarin, distributed as
    follows:

    +-----------+--------------+-------+
    | section | section | cases |
    +-----------+--------------+-------+
    | 扩展A | Extension A | 184 |
    | 扩展B | Extension B | 21 |
    | 統一字 | Unified CJK | 665 |
    +-----------+--------------+-------+

    I guess I expected to find discrepancies among the rare hanzi of
    Extension A and Extension B. But I'm a bit more surprised to see the
    number of discrepancies among the more "common" CJK in the Unified CJK
    section.

    Some of these are just discrepancies in tone, as in these cases (from
    the Unified CJK subset):

    | U+5862 | 塢 | wu4 | wu3 |
    | U+58A6 | 墦 | fan2 | fan1 |
    | U+8FBF | 辿 | chan1 | chan2 |

    These characters are certainly less common (in Mandarin, at least)
    even if they are in the Unified CJK section. And there are also cases
    where kMandarin and kHanyuPinyin disagree completely (again, selecting
    examples from the Unified CJK subset):

    | U+56F2 | 囲 | wei2 | tong1 |
    | U+50E0 | 僠 | fan1 | bo1 |
    | U+4FE4 | 俤 | ti4 | di4 |

    Of course I've considered the possibility that some of my results may
    be artifacts due to the fact that 漢語大字典 has "multiple locations" --
    but note that I picked only cases with exactly one pronounciation in
    kHanyuPinyin and likewise only one pronounciation in kMandarin. So I
    think my results are correct unless I somehow messed things up when
    prepping the data or running my SQL queries ... Unfortunately, I don't
    have a paper copy of 漢語大字典 to use for backup verification of my
    electronic manipulations.

    In any case, UAX#38 gives this example, inter alia:

    | U+5364 | 卤 | 10093.130: xī,lǔ 74609.020: lǔ,xī |

    ...which again seems to confirm that there really is no information in
    the kHanyuPinyin field to tell me whether "卤" is more commonly
    pronounced "xī" or "lǔ". Am I right?

    Finally, in light of my original quest as stated at the beginning of
    this email, does it make sense to follow this procedure:

            (1) Preferenatially use pronounciation data from kMandarin
    whenever available
            (2) Fall back to kHanyuPinyin data when kMandarin is missing

    ... or is there an altogether better way to do this?

    Best - Ed



    This archive was generated by hypermail 2.1.5 : Wed Jan 27 2010 - 15:34:23 CST