From: Satoshi Nakagawa (psychs@limechat.net)
Date: Mon Oct 12 2009 - 09:46:10 CDT
Hi,
I found a problem of Unicode Collation Algorithm in Japanese text handling.
In the current Unicode Collation Algorithm, [っ] (U+3063) is considered
as a small form of [つ] (U+3064). Then, these characters are treated as
the same just like [a] and [A] in the primary strength and in the
secondary strength.
As you know, in English, [abc] and [ABC] are treated as the same in a
case insensitive context.
But in Japanese, for example, [あった] and [あつた] are different words in
any contexts. Because in Japanese semantics, [っ] is not considered as
a small form of [つ]. These characters are never treated as the same
characters.
In the current Unicode Collation Algorithm, the character pairs in the
following list are treated as the same in the primary strength and in
the secondary strength. But these character pairs should be always
treated as different characters.
あ (U+3042) / ぁ (U+3041)
い (U+3044) / ぃ (U+3043)
う (U+3046) / ぅ (U+3045)
え (U+3048) / ぇ (U+3047)
お (U+304A) / ぉ (U+3049)
か (U+304B) / ゕ (U+3095)
け (U+3051) / ゖ (U+3096)
つ (U+3064) / っ (U+3063)
や (U+3084) / ゃ (U+3083)
ゆ (U+3086) / ゅ (U+3085)
よ (U+3088) / ょ (U+3087)
わ (U+308F) / ゎ (U+308E)
ア (U+30A2) / ァ (U+30A1)
イ (U+30A4) / ィ (U+30A3)
ウ (U+30A6) / ゥ (U+30A5)
エ (U+30A8) / ェ (U+30A7)
オ (U+30AA) / ォ (U+30A9)
カ (U+30AB) / ヵ (U+30F5)
ケ (U+30B1) / ヶ (U+30F6)
ツ (U+30C4) / ッ (U+30C3)
ヤ (U+30E4) / ャ (U+30E3)
ユ (U+30E6) / ュ (U+30E5)
ヨ (U+30E8) / ョ (U+30E7)
ワ (U+30EF) / ヮ (U+30EE)
ア (U+FF71) / ァ (U+FF67)
イ (U+FF72) / ィ (U+FF68)
ウ (U+FF73) / ゥ (U+FF69)
エ (U+FF74) / ェ (U+FF6A)
オ (U+FF75) / ォ (U+FF6B)
ヤ (U+FF94) / ャ (U+FF6C)
ユ (U+FF95) / ュ (U+FF6D)
ヨ (U+FF96) / ョ (U+FF6E)
In the following page, you can see how this problem reproduces in Safari.
http://limechat.net/report/unicode-collation-problem.html
-- Satoshi Nakagawa
This archive was generated by hypermail 2.1.5 : Mon Oct 12 2009 - 11:12:18 CDT