Re: Japanese text handling problem in Unicode Collation Algorithm

From: Mark Davis ☕ (mark@macchiato.com)
Date: Mon Oct 12 2009 - 12:07:39 CDT

Next message: Charlie Ruland ☘: "Error in UTF #10"

Previous message: Jon Hanna: "Re: Unicode Haiku Contest"
In reply to: Satoshi Nakagawa: "Japanese text handling problem in Unicode Collation Algorithm"
Next in thread: Satoshi Nakagawa: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Reply: Satoshi Nakagawa: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

UTS#10 does not necessarily match the sorting of any particular language. To
match the requirements of given languages, see the Unicode CLDR collation
data. To the best of my knowledge, that is following the JIS standard for
how to deal with these characters. If that is incorrect, please let us know.

Mark

On Mon, Oct 12, 2009 at 07:46, Satoshi Nakagawa <psychs@limechat.net> wrote:

> Hi,
>
> I found a problem of Unicode Collation Algorithm in Japanese text handling.
>
> In the current Unicode Collation Algorithm, [っ] (U+3063) is considered
> as a small form of [つ] (U+3064). Then, these characters are treated as
> the same just like [a] and [A] in the primary strength and in the
> secondary strength.
>
> As you know, in English, [abc] and [ABC] are treated as the same in a
> case insensitive context.
>
> But in Japanese, for example, [あった] and [あつた] are different words in
> any contexts. Because in Japanese semantics, [っ] is not considered as
> a small form of [つ]. These characters are never treated as the same
> characters.
>
> In the current Unicode Collation Algorithm, the character pairs in the
> following list are treated as the same in the primary strength and in
> the secondary strength. But these character pairs should be always
> treated as different characters.
>
> あ (U+3042) / ぁ (U+3041)
> い (U+3044) / ぃ (U+3043)
> う (U+3046) / ぅ (U+3045)
> え (U+3048) / ぇ (U+3047)
> お (U+304A) / ぉ (U+3049)
> か (U+304B) / ゕ (U+3095)
> け (U+3051) / ゖ (U+3096)
> つ (U+3064) / っ (U+3063)
> や (U+3084) / ゃ (U+3083)
> ゆ (U+3086) / ゅ (U+3085)
> よ (U+3088) / ょ (U+3087)
> わ (U+308F) / ゎ (U+308E)
> ア (U+30A2) / ァ (U+30A1)
> イ (U+30A4) / ィ (U+30A3)
> ウ (U+30A6) / ゥ (U+30A5)
> エ (U+30A8) / ェ (U+30A7)
> オ (U+30AA) / ォ (U+30A9)
> カ (U+30AB) / ヵ (U+30F5)
> ケ (U+30B1) / ヶ (U+30F6)
> ツ (U+30C4) / ッ (U+30C3)
> ヤ (U+30E4) / ャ (U+30E3)
> ユ (U+30E6) / ュ (U+30E5)
> ヨ (U+30E8) / ョ (U+30E7)
> ワ (U+30EF) / ヮ (U+30EE)
> ｱ (U+FF71) / ｧ (U+FF67)
> ｲ (U+FF72) / ｨ (U+FF68)
> ｳ (U+FF73) / ｩ (U+FF69)
> ｴ (U+FF74) / ｪ (U+FF6A)
> ｵ (U+FF75) / ｫ (U+FF6B)
> ﾔ (U+FF94) / ｬ (U+FF6C)
> ﾕ (U+FF95) / ｭ (U+FF6D)
> ﾖ (U+FF96) / ｮ (U+FF6E)
>
> In the following page, you can see how this problem reproduces in Safari.
>
> http://limechat.net/report/unicode-collation-problem.html
>
> --
> Satoshi Nakagawa
>
>
>

Next message: Charlie Ruland ☘: "Error in UTF #10"
Previous message: Jon Hanna: "Re: Unicode Haiku Contest"
In reply to: Satoshi Nakagawa: "Japanese text handling problem in Unicode Collation Algorithm"
Next in thread: Satoshi Nakagawa: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Reply: Satoshi Nakagawa: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 12 2009 - 12:09:33 CDT