Japanese text handling problem in Unicode Collation Algorithm

From: Satoshi Nakagawa (psychs@limechat.net)
Date: Mon Oct 12 2009 - 09:46:10 CDT

  • Next message: Asmus Freytag: "Re: Unicode Haiku Contest"

    Hi,

    I found a problem of Unicode Collation Algorithm in Japanese text handling.

    In the current Unicode Collation Algorithm, [っ] (U+3063) is considered
    as a small form of [つ] (U+3064). Then, these characters are treated as
    the same just like [a] and [A] in the primary strength and in the
    secondary strength.

    As you know, in English, [abc] and [ABC] are treated as the same in a
    case insensitive context.

    But in Japanese, for example, [あった] and [あつた] are different words in
    any contexts. Because in Japanese semantics, [っ] is not considered as
    a small form of [つ]. These characters are never treated as the same
    characters.

    In the current Unicode Collation Algorithm, the character pairs in the
    following list are treated as the same in the primary strength and in
    the secondary strength. But these character pairs should be always
    treated as different characters.

    あ (U+3042) / ぁ (U+3041)
    い (U+3044) / ぃ (U+3043)
    う (U+3046) / ぅ (U+3045)
    え (U+3048) / ぇ (U+3047)
    お (U+304A) / ぉ (U+3049)
    か (U+304B) / ゕ (U+3095)
    け (U+3051) / ゖ (U+3096)
    つ (U+3064) / っ (U+3063)
    や (U+3084) / ゃ (U+3083)
    ゆ (U+3086) / ゅ (U+3085)
    よ (U+3088) / ょ (U+3087)
    わ (U+308F) / ゎ (U+308E)
    ア (U+30A2) / ァ (U+30A1)
    イ (U+30A4) / ィ (U+30A3)
    ウ (U+30A6) / ゥ (U+30A5)
    エ (U+30A8) / ェ (U+30A7)
    オ (U+30AA) / ォ (U+30A9)
    カ (U+30AB) / ヵ (U+30F5)
    ケ (U+30B1) / ヶ (U+30F6)
    ツ (U+30C4) / ッ (U+30C3)
    ヤ (U+30E4) / ャ (U+30E3)
    ユ (U+30E6) / ュ (U+30E5)
    ヨ (U+30E8) / ョ (U+30E7)
    ワ (U+30EF) / ヮ (U+30EE)
    ア (U+FF71) / ァ (U+FF67)
    イ (U+FF72) / ィ (U+FF68)
    ウ (U+FF73) / ゥ (U+FF69)
    エ (U+FF74) / ェ (U+FF6A)
    オ (U+FF75) / ォ (U+FF6B)
    ヤ (U+FF94) / ャ (U+FF6C)
    ユ (U+FF95) / ュ (U+FF6D)
    ヨ (U+FF96) / ョ (U+FF6E)

    In the following page, you can see how this problem reproduces in Safari.

    http://limechat.net/report/unicode-collation-problem.html

    --
    Satoshi Nakagawa
    


    This archive was generated by hypermail 2.1.5 : Mon Oct 12 2009 - 11:12:18 CDT