Re: Problems/Issues with CJK and Unicode

From: Glen Perkins (Glen.Perkins@NativeGuide.com)
Date: Sat Apr 08 2000 - 03:17:56 EDT

Next message: Glen Perkins: "Re: charset question"
Previous message: Doug Ewell: "Re: charset question"
Maybe in reply to: Mark.Conover@luminant.com: "Problems/Issues with CJK and Unicode"
Next in thread: jon@kanji.com: "Re: Problems/Issues with CJK and Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Only Korean is sorted by pronunciation. It's sorting is incomplete, though, and to accomplish even that, they put the same character at multiple codepoints if it has multiple readings, giving it an entry for every reading. Doing so causes other problems. A word can be written with the correct hanja but the incorrect code point. If so, it will be read correctly by human proofreaders, but mis-sorted by machine. It's even more ironic that they should do this given the additional pronunciation variation that occurs depending on the positioning of the character. (ryeok + sa = yeoksa, "history" [Japanese "rekishi"]), but the character ryeok isn't given another codepoint at yeok, even though it has to be sorted as yeok in a dictionary. Dictionary makers get around this problem by using the hangul as the dictionary entry, with the hanja after it as part of the definition, which is the reverse of the Japanese system. The Korean dictionary makers can then sort on the hangul, which means it doesn't matter to them what order the hanja is in, so the multiple encoding of hanja is just a nuisance.

Japanese kanji character sets are mostly not sorted by pronunciation. Within the same character set (JIS X 0208), the "JIS level 1" characters are sorted by just one of the many possible pronunciations of the character (no entries for any other pronunciations), and the "JIS level 2" characters in the same character set are ordered by radical & stroke count, with no pronunciation component at all. Likewise, JIS X 0212 has no kanji sorted by pronunciation at all, so you can hardly say Han Unification has created a problem for Japanese collation.

The only thing that really helps in Korean is having the precomposed hangulja in dictionary order. That can allow for non-table-based sorting of hangul. For South Koreans -- though not North Koreans -- Unicode offers that same benefit, so nothing of importance is lost in a switch to Unicode (and the rest of the world is gained).

__Glen Perkins__

  ----- Original Message -----
  From: Hoon Kim
  To: Unicode List
  Sent: Friday, April 07, 2000 10:58 AM
  Subject: RE: Problems/Issues with CJK and Unicode

  "Sort" would be one of those problem.
  (For Korean and Japanese, you would expect to sort by pronunciation, which would be different than the order Unihan characters were placed on)

  Hoon Kim
  Basis Technology Corp.
    -----Original Message-----
    From: Mark.Conover@luminant.com [mailto:Mark.Conover@luminant.com]
    Sent: Friday, April 07, 2000 1:26 PM
    To: Unicode List
    Subject: Problems/Issues with CJK and Unicode

I have heard that there are "problems" with the way Unicode handles CJK script; perhaps due to the unification of some characters. Would someone in this list mind offering a bit more insight into this matter?

Thank you,

Mark Conover
Luminant/Seattle, USA

Next message: Glen Perkins: "Re: charset question"
Previous message: Doug Ewell: "Re: charset question"
Maybe in reply to: Mark.Conover@luminant.com: "Problems/Issues with CJK and Unicode"
Next in thread: jon@kanji.com: "Re: Problems/Issues with CJK and Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT