Japanese text handling problem in Unicode Collation Algorithm

From: Satoshi Nakagawa (psychs@limechat.net)
Date: Mon Oct 12 2009 - 09:46:10 CDT

Next message: Asmus Freytag: "Re: Unicode Haiku Contest"

Previous message: Chris Weber: "detecting ill-formed UTF-8 (was RE: [question] UTF-8 issue)"
Next in thread: Mark Davis ☕: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Reply: Mark Davis ☕: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Maybe reply: Kenneth Whistler: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi,

I found a problem of Unicode Collation Algorithm in Japanese text handling.

In the current Unicode Collation Algorithm, [っ] (U+3063) is considered
as a small form of [つ] (U+3064). Then, these characters are treated as
the same just like [a] and [A] in the primary strength and in the
secondary strength.

As you know, in English, [abc] and [ABC] are treated as the same in a
case insensitive context.

But in Japanese, for example, [あった] and [あつた] are different words in
any contexts. Because in Japanese semantics, [っ] is not considered as
a small form of [つ]. These characters are never treated as the same
characters.

In the current Unicode Collation Algorithm, the character pairs in the
following list are treated as the same in the primary strength and in
the secondary strength. But these character pairs should be always
treated as different characters.

あ (U+3042) / ぁ (U+3041)
い (U+3044) / ぃ (U+3043)
う (U+3046) / ぅ (U+3045)
え (U+3048) / ぇ (U+3047)
お (U+304A) / ぉ (U+3049)
か (U+304B) / ゕ (U+3095)
け (U+3051) / ゖ (U+3096)
つ (U+3064) / っ (U+3063)
や (U+3084) / ゃ (U+3083)
ゆ (U+3086) / ゅ (U+3085)
よ (U+3088) / ょ (U+3087)
わ (U+308F) / ゎ (U+308E)
ア (U+30A2) / ァ (U+30A1)
イ (U+30A4) / ィ (U+30A3)
ウ (U+30A6) / ゥ (U+30A5)
エ (U+30A8) / ェ (U+30A7)
オ (U+30AA) / ォ (U+30A9)
カ (U+30AB) / ヵ (U+30F5)
ケ (U+30B1) / ヶ (U+30F6)
ツ (U+30C4) / ッ (U+30C3)
ヤ (U+30E4) / ャ (U+30E3)
ユ (U+30E6) / ュ (U+30E5)
ヨ (U+30E8) / ョ (U+30E7)
ワ (U+30EF) / ヮ (U+30EE)
ｱ (U+FF71) / ｧ (U+FF67)
ｲ (U+FF72) / ｨ (U+FF68)
ｳ (U+FF73) / ｩ (U+FF69)
ｴ (U+FF74) / ｪ (U+FF6A)
ｵ (U+FF75) / ｫ (U+FF6B)
ﾔ (U+FF94) / ｬ (U+FF6C)
ﾕ (U+FF95) / ｭ (U+FF6D)
ﾖ (U+FF96) / ｮ (U+FF6E)

In the following page, you can see how this problem reproduces in Safari.

http://limechat.net/report/unicode-collation-problem.html

--
Satoshi Nakagawa

Next message: Asmus Freytag: "Re: Unicode Haiku Contest"
Previous message: Chris Weber: "detecting ill-formed UTF-8 (was RE: [question] UTF-8 issue)"
Next in thread: Mark Davis ☕: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Reply: Mark Davis ☕: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Maybe reply: Kenneth Whistler: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 12 2009 - 11:12:18 CDT