Re: Transform for Hans with multiple pronunciations

From: Ed Trager (ed.trager@gmail.com)
Date: Fri Jan 29 2010 - 08:26:05 CST

Next message: Ed Trager: "Re: FYI: Google blog on Unicode"

Previous message: Andrew West: "Re: Transform for Hans with multiple pronunciations"
In reply to: Andrew West: "Re: Transform for Hans with multiple pronunciations"
Next in thread: spir: "Re: Transform for Hans with multiple pronunciations"
Reply: spir: "Re: Transform for Hans with multiple pronunciations"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Fri, Jan 29, 2010 at 4:50 AM, Andrew West <andrewcwest@gmail.com> wrote:
> 2010/1/29 Christoph Burgmer <cburgmer@ira.uka.de>:
>>>
>>> I'm using the ICU transform demo from:
>>> http://demo.icu-project.org/icu-bin/translit
>>> But for now, the demo page can only show Latin transform of 行 as xíng,
>>
>> I doubt this lies inside the scope of the ICU project.
>
> I disagree completely. If it is going to include Han -> Latin
> transformation then it should do it properly or not at all. 行 -> xíng
> is not ideal, but 银行 -> yín xíng is just plain wrong. It may be
> troublesome to do it correctly (show multiple readings in frequency
> order for single characters, and show the correct reading for compound
> words if possible), but it can be done. Doing it wrong like this is
> less than useful.
>

To do it correctly will require that ICU has a Chinese phrase
dictionary available to it so that it would "know" the correct
pronounciation of characters in words consisting of more than one
character (as in the 银行 yin hang / 自行车 zi xing che example).

While that would indeed be a nice feature, it seems to me that this
kind of feature involves a degree of specialization which I am not
sure is appropriate for incorporation into a "general" Unicode library
like ICU, especially in light of the extra "bloat" that will be
involved -- probably on the order of at least 300 KB for the bare
minimum phrase dictionary, and likely a lot more for a high-quality,
more comprehensive phrase list.

>
> Another question is why it only transforms to pinyin, and does not
> include (or allow selection of) Japanese, Korean and Vietnamese
> readings where appropriate.
>

For Japanese at least, one is going to have an even worse problem with
multiple readings for each character than for Chinese! I'm not sure
about the situation for Korean and Vietnamese.

One "quick and dirty" solution which has often been used in the past,
but is certainly not very satisfactory, is to place square brackets
around the multiple readings of a given character, something like
this:

自行车： zi4 [ xing2 hang2 ] che1

This level of solution, although not ideal, is something that ICU
could easily implement.

> Andrew
>
>
>

Next message: Ed Trager: "Re: FYI: Google blog on Unicode"
Previous message: Andrew West: "Re: Transform for Hans with multiple pronunciations"
In reply to: Andrew West: "Re: Transform for Hans with multiple pronunciations"
Next in thread: spir: "Re: Transform for Hans with multiple pronunciations"
Reply: spir: "Re: Transform for Hans with multiple pronunciations"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 29 2010 - 08:31:24 CST