Re: Japanese text handling problem in Unicode Collation Algorithm

From: Mark Davis ☕ (mark@macchiato.com)
Date: Mon Oct 12 2009 - 12:07:39 CDT

  • Next message: Charlie Ruland ☘: "Error in UTF #10"

    UTS#10 does not necessarily match the sorting of any particular language. To
    match the requirements of given languages, see the Unicode CLDR collation
    data. To the best of my knowledge, that is following the JIS standard for
    how to deal with these characters. If that is incorrect, please let us know.

    Mark

    On Mon, Oct 12, 2009 at 07:46, Satoshi Nakagawa <psychs@limechat.net> wrote:

    > Hi,
    >
    > I found a problem of Unicode Collation Algorithm in Japanese text handling.
    >
    > In the current Unicode Collation Algorithm, [っ] (U+3063) is considered
    > as a small form of [つ] (U+3064). Then, these characters are treated as
    > the same just like [a] and [A] in the primary strength and in the
    > secondary strength.
    >
    > As you know, in English, [abc] and [ABC] are treated as the same in a
    > case insensitive context.
    >
    > But in Japanese, for example, [あった] and [あつた] are different words in
    > any contexts. Because in Japanese semantics, [っ] is not considered as
    > a small form of [つ]. These characters are never treated as the same
    > characters.
    >
    > In the current Unicode Collation Algorithm, the character pairs in the
    > following list are treated as the same in the primary strength and in
    > the secondary strength. But these character pairs should be always
    > treated as different characters.
    >
    > あ (U+3042) / ぁ (U+3041)
    > い (U+3044) / ぃ (U+3043)
    > う (U+3046) / ぅ (U+3045)
    > え (U+3048) / ぇ (U+3047)
    > お (U+304A) / ぉ (U+3049)
    > か (U+304B) / ゕ (U+3095)
    > け (U+3051) / ゖ (U+3096)
    > つ (U+3064) / っ (U+3063)
    > や (U+3084) / ゃ (U+3083)
    > ゆ (U+3086) / ゅ (U+3085)
    > よ (U+3088) / ょ (U+3087)
    > わ (U+308F) / ゎ (U+308E)
    > ア (U+30A2) / ァ (U+30A1)
    > イ (U+30A4) / ィ (U+30A3)
    > ウ (U+30A6) / ゥ (U+30A5)
    > エ (U+30A8) / ェ (U+30A7)
    > オ (U+30AA) / ォ (U+30A9)
    > カ (U+30AB) / ヵ (U+30F5)
    > ケ (U+30B1) / ヶ (U+30F6)
    > ツ (U+30C4) / ッ (U+30C3)
    > ヤ (U+30E4) / ャ (U+30E3)
    > ユ (U+30E6) / ュ (U+30E5)
    > ヨ (U+30E8) / ョ (U+30E7)
    > ワ (U+30EF) / ヮ (U+30EE)
    > ア (U+FF71) / ァ (U+FF67)
    > イ (U+FF72) / ィ (U+FF68)
    > ウ (U+FF73) / ゥ (U+FF69)
    > エ (U+FF74) / ェ (U+FF6A)
    > オ (U+FF75) / ォ (U+FF6B)
    > ヤ (U+FF94) / ャ (U+FF6C)
    > ユ (U+FF95) / ュ (U+FF6D)
    > ヨ (U+FF96) / ョ (U+FF6E)
    >
    > In the following page, you can see how this problem reproduces in Safari.
    >
    > http://limechat.net/report/unicode-collation-problem.html
    >
    > --
    > Satoshi Nakagawa
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Oct 12 2009 - 12:09:33 CDT