Re: Japanese text handling problem in Unicode Collation Algorithm

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Oct 12 2009 - 16:16:24 CDT

Next message: Deborah Goldsmith: "Re: Japanese text handling problem in Unicode Collation Algorithm"

Previous message: Satoshi Nakagawa: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Maybe in reply to: Satoshi Nakagawa: "Japanese text handling problem in Unicode Collation Algorithm"
Next in thread: Deborah Goldsmith: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Reply: Deborah Goldsmith: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Reply: Satoshi Nakagawa: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Satoshi Nakagawa said:

> I have checked the Unicode CLDR collation data, but it contains data
> only for the tertiary strength.
>
> IMHO, for example, [っ] (U+3063) and [つ] (U+3064) shoule be treated as
> different characters in the primary strength. Because these are never
> treated as the same characters in Japanese, even if these have similar
> gryphs.

Nobody is disputing that they are not treated as the same characters
in Japanese.

Note that for the purposes of weighting in the DUCET table, the only
difference between "A" and "a" is their tertiary weights -- but it
is quite clear to everyone that they are not the "same" characters
in English or any other language. However, that distinction is not carried
in the collation tables by forcing them to have primary weight distinctions.

I think your point is that U+3063 and U+3064 are not alternate
spellings in Japanese -- so they are lexically distinct in ways
that case pairs of Latin letters typically are not. However, even
for case differences, there are certainly lexical differences in
English (and other languages) where uppercase versus lowercase
are *not* optional, and do make systematic differences in meaning.
See, for example, German, where systematic uppercasing of nouns is
not optional, but a required aspect of spelling -- and where substituting
one character for the other would be considered simply wrong.

>
> I would suggest to mofidy the Default Unicode Collation Element Table.
>
> In http://www.unicode.org/Public/UCA/latest/allkeys.txt,
>
> 3063 ; [.27B0.0020.000D.3063] # HIRAGANA LETTER SMALL TU
> 3064 ; [.27B0.0020.000E.3064] # HIRAGANA LETTER TU
> 30C3 ; [.27B0.0020.000F.30C3] # KATAKANA LETTER SMALL TU
> FF6F ; [.27B0.0020.0010.FF6F] # HALFWIDTH KATAKANA LETTER SMALL TU; QQK
> 30C4 ; [.27B0.0020.0011.30C4] # KATAKANA LETTER TU
> FF82 ; [.27B0.0020.0012.FF82] # HALFWIDTH KATAKANA LETTER TU; QQK
> 32E1 ; [.27B0.0020.0013.32E1] # CIRCLED KATAKANA TU; QQK
> 3065 ; [.27B0.0020.000E.3064][.0000.018B.0002.3099] # HIRAGANA LETTER DU; QQCM
> 30C5 ; [.27B0.0020.0011.30C4][.0000.018B.0002.3099] # KATAKANA LETTER DU; QQCM
>
> this part specifies [っ] (U+3063) and [つ] (U+3064) are treated as the
> same character in the primary strength and the secondary strength.
>
> My suggestion would be like this.
>
> 3063 ; [.3267.0020.000D.3063] # HIRAGANA LETTER SMALL TU
> 3064 ; [.27B0.0020.000E.3064] # HIRAGANA LETTER TU
> 30C3 ; [.3267.0020.000F.30C3] # KATAKANA LETTER SMALL TU
> FF6F ; [.3267.0020.0010.FF6F] # HALFWIDTH KATAKANA LETTER SMALL TU; QQK
> 30C4 ; [.27B0.0020.0011.30C4] # KATAKANA LETTER TU
> FF82 ; [.27B0.0020.0012.FF82] # HALFWIDTH KATAKANA LETTER TU; QQK
> 32E1 ; [.27B0.0020.0013.32E1] # CIRCLED KATAKANA TU; QQK
> 3065 ; [.27B0.0020.000E.3064][.0000.018B.0002.3099] # HIRAGANA LETTER DU; QQCM
> 30C5 ; [.27B0.0020.0011.30C4][.0000.018B.0002.3099] # KATAKANA LETTER DU; QQCM
>
> Then [っ] (U+3063) and [つ] (U+3064) are always treated as different characters.

Those particular weights would end up with a collation order completely
unlike what you show there -- as a primary weight of "3267" for the
small tu letters (for that UCA 5.1 version of the table) would result in
small tu sorting after *every* other Japanese syllable (after te, after to,
after na, .... after wa... -- indeed after every other character in
every scripts except Han by default).

> And not only [っ] and [つ], all character pairs in my last mail should
> be also modified as well.
>
> Note that the JIS standard didn't tell about collation algorithm and
> sorting order as far as I know.

Mark is not talking about the JIS X 0208 (or JIS X 0212 or JIS X 0213)
character encoding standard. He is talking about the JIS X 4061-1996 Japanese
sorting standard.

And that standard does specify the distinction of small kana versus their
large kana forms as a *third level* distinction in the sorting.

More precisely:

Level 1: The basic syllabic ordering:

a i u e o ka ki ku ke ko ...

Level 2: Diacritic ordering.

   A voiceless kana < voiced kana < semi-voiced (if exists)

   i.e. ka << ga, ha << ba << pa

Level 3:

Small kana <<< normal kana

Level 4:

Hiragana <<<< Katakana

There is more to it than that, of course, including handling of the
prolonged sound mark, and the iteration marks.

But a pretty serious effort was made to get the DUCET table to
match the JIS X 4601 specification as closely as is feasible, given
the architectural constraints of UCA.

So I'm going to disagree with the premise that the DUCET table per se
is at fault here.

The issue, instead, seems to be that since ICU collation is built directly
on UCA and since certain open source (and proprietary) applications are
then built directly on ICU, they surface behavior that may not be optimal
for searching (or sorting) for all languages.

And that actually is not too surprising, either, because DUCET is not
designed to provide optimal behavior for any given language without
tailoring.

But in the case of Japanese, the issue for you seems to boil down to
the fact that a search on a Japanese string in Safari doesn't
distinguish between small and large kana. That amounts to a mistaken
assumption (IMO) that a tertiary distinction is not important in
distinguishing search terms for Japanese. In other words, because
the search is built on an ICU collator set to ignore tertiary distinctions
(i.e. it is effectively "case-folding" for matches), it is giving
false positive matches where you think it shouldn't.

There are various ways to handle this, including tailoring and
language or script-specific differences in handling tertiary distinctions
for the purposes of search terms. But it seems clear to me that
it should *not* be "fixed" by changing the DUCET table at this point --
as that would be guaranteed to upset actual collation and sorting
by other applications, as well as disrupting the basis for any
Japanese tailorings for UCA that may already exist.

--Ken

Next message: Deborah Goldsmith: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Previous message: Satoshi Nakagawa: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Maybe in reply to: Satoshi Nakagawa: "Japanese text handling problem in Unicode Collation Algorithm"
Next in thread: Deborah Goldsmith: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Reply: Deborah Goldsmith: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Reply: Satoshi Nakagawa: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 12 2009 - 16:20:37 CDT