Re: Transform for Hans with multiple pronunciations

From: Ed Trager (
Date: Fri Jan 29 2010 - 08:26:05 CST

  • Next message: Ed Trager: "Re: FYI: Google blog on Unicode"

    On Fri, Jan 29, 2010 at 4:50 AM, Andrew West <> wrote:
    > 2010/1/29 Christoph Burgmer <>:
    >>> I'm using the ICU transform demo from:
    >>> But for now, the demo page can only show Latin transform of 行 as xíng,
    >> I doubt this lies inside the scope of the ICU project.
    > I disagree completely. If it is going to include Han -> Latin
    > transformation then it should do it properly or not at all. 行 -> xíng
    > is not ideal, but 银行 -> yín xíng is just plain wrong. It may be
    > troublesome to do it correctly (show multiple readings in frequency
    > order for single characters, and show the correct reading for compound
    > words if possible), but it can be done. Doing it wrong like this is
    > less than useful.

    To do it correctly will require that ICU has a Chinese phrase
    dictionary available to it so that it would "know" the correct
    pronounciation of characters in words consisting of more than one
    character (as in the 银行 yin hang / 自行车 zi xing che example).

    While that would indeed be a nice feature, it seems to me that this
    kind of feature involves a degree of specialization which I am not
    sure is appropriate for incorporation into a "general" Unicode library
    like ICU, especially in light of the extra "bloat" that will be
    involved -- probably on the order of at least 300 KB for the bare
    minimum phrase dictionary, and likely a lot more for a high-quality,
    more comprehensive phrase list.

    > Another question is why it only transforms to pinyin, and does not
    > include (or allow selection of) Japanese, Korean and Vietnamese
    > readings where appropriate.

    For Japanese at least, one is going to have an even worse problem with
    multiple readings for each character than for Chinese! I'm not sure
    about the situation for Korean and Vietnamese.

    One "quick and dirty" solution which has often been used in the past,
    but is certainly not very satisfactory, is to place square brackets
    around the multiple readings of a given character, something like

              自行车 : zi4 [ xing2 hang2 ] che1

    This level of solution, although not ideal, is something that ICU
    could easily implement.

    > Andrew

    This archive was generated by hypermail 2.1.5 : Fri Jan 29 2010 - 08:31:24 CST