Re: Japanese text handling problem in Unicode Collation Algorithm

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Oct 12 2009 - 16:16:24 CDT

  • Next message: Deborah Goldsmith: "Re: Japanese text handling problem in Unicode Collation Algorithm"

    Satoshi Nakagawa said:

    > I have checked the Unicode CLDR collation data, but it contains data
    > only for the tertiary strength.
    >
    > IMHO, for example, [っ] (U+3063) and [つ] (U+3064) shoule be treated as
    > different characters in the primary strength. Because these are never
    > treated as the same characters in Japanese, even if these have similar
    > gryphs.

    Nobody is disputing that they are not treated as the same characters
    in Japanese.

    Note that for the purposes of weighting in the DUCET table, the only
    difference between "A" and "a" is their tertiary weights -- but it
    is quite clear to everyone that they are not the "same" characters
    in English or any other language. However, that distinction is not carried
    in the collation tables by forcing them to have primary weight distinctions.

    I think your point is that U+3063 and U+3064 are not alternate
    spellings in Japanese -- so they are lexically distinct in ways
    that case pairs of Latin letters typically are not. However, even
    for case differences, there are certainly lexical differences in
    English (and other languages) where uppercase versus lowercase
    are *not* optional, and do make systematic differences in meaning.
    See, for example, German, where systematic uppercasing of nouns is
    not optional, but a required aspect of spelling -- and where substituting
    one character for the other would be considered simply wrong.

    >
    > I would suggest to mofidy the Default Unicode Collation Element Table.
    >
    > In http://www.unicode.org/Public/UCA/latest/allkeys.txt,
    >
    > 3063 ; [.27B0.0020.000D.3063] # HIRAGANA LETTER SMALL TU
    > 3064 ; [.27B0.0020.000E.3064] # HIRAGANA LETTER TU
    > 30C3 ; [.27B0.0020.000F.30C3] # KATAKANA LETTER SMALL TU
    > FF6F ; [.27B0.0020.0010.FF6F] # HALFWIDTH KATAKANA LETTER SMALL TU; QQK
    > 30C4 ; [.27B0.0020.0011.30C4] # KATAKANA LETTER TU
    > FF82 ; [.27B0.0020.0012.FF82] # HALFWIDTH KATAKANA LETTER TU; QQK
    > 32E1 ; [.27B0.0020.0013.32E1] # CIRCLED KATAKANA TU; QQK
    > 3065 ; [.27B0.0020.000E.3064][.0000.018B.0002.3099] # HIRAGANA LETTER DU; QQCM
    > 30C5 ; [.27B0.0020.0011.30C4][.0000.018B.0002.3099] # KATAKANA LETTER DU; QQCM
    >
    > this part specifies [っ] (U+3063) and [つ] (U+3064) are treated as the
    > same character in the primary strength and the secondary strength.
    >
    > My suggestion would be like this.
    >
    > 3063 ; [.3267.0020.000D.3063] # HIRAGANA LETTER SMALL TU
    > 3064 ; [.27B0.0020.000E.3064] # HIRAGANA LETTER TU
    > 30C3 ; [.3267.0020.000F.30C3] # KATAKANA LETTER SMALL TU
    > FF6F ; [.3267.0020.0010.FF6F] # HALFWIDTH KATAKANA LETTER SMALL TU; QQK
    > 30C4 ; [.27B0.0020.0011.30C4] # KATAKANA LETTER TU
    > FF82 ; [.27B0.0020.0012.FF82] # HALFWIDTH KATAKANA LETTER TU; QQK
    > 32E1 ; [.27B0.0020.0013.32E1] # CIRCLED KATAKANA TU; QQK
    > 3065 ; [.27B0.0020.000E.3064][.0000.018B.0002.3099] # HIRAGANA LETTER DU; QQCM
    > 30C5 ; [.27B0.0020.0011.30C4][.0000.018B.0002.3099] # KATAKANA LETTER DU; QQCM
    >
    > Then [っ] (U+3063) and [つ] (U+3064) are always treated as different characters.

    Those particular weights would end up with a collation order completely
    unlike what you show there -- as a primary weight of "3267" for the
    small tu letters (for that UCA 5.1 version of the table) would result in
    small tu sorting after *every* other Japanese syllable (after te, after to,
    after na, .... after wa... -- indeed after every other character in
    every scripts except Han by default).

    > And not only [っ] and [つ], all character pairs in my last mail should
    > be also modified as well.
    >
    > Note that the JIS standard didn't tell about collation algorithm and
    > sorting order as far as I know.

    Mark is not talking about the JIS X 0208 (or JIS X 0212 or JIS X 0213)
    character encoding standard. He is talking about the JIS X 4061-1996 Japanese
    sorting standard.

    And that standard does specify the distinction of small kana versus their
    large kana forms as a *third level* distinction in the sorting.

    More precisely:

    Level 1: The basic syllabic ordering:

       a i u e o ka ki ku ke ko ...
       
    Level 2: Diacritic ordering.

       A voiceless kana < voiced kana < semi-voiced (if exists)
       
       i.e. ka << ga, ha << ba << pa
       
    Level 3:

       Small kana <<< normal kana
       
    Level 4:

       Hiragana <<<< Katakana
       
    There is more to it than that, of course, including handling of the
    prolonged sound mark, and the iteration marks.

    But a pretty serious effort was made to get the DUCET table to
    match the JIS X 4601 specification as closely as is feasible, given
    the architectural constraints of UCA.

    So I'm going to disagree with the premise that the DUCET table per se
    is at fault here.

    The issue, instead, seems to be that since ICU collation is built directly
    on UCA and since certain open source (and proprietary) applications are
    then built directly on ICU, they surface behavior that may not be optimal
    for searching (or sorting) for all languages.

    And that actually is not too surprising, either, because DUCET is not
    designed to provide optimal behavior for any given language without
    tailoring.

    But in the case of Japanese, the issue for you seems to boil down to
    the fact that a search on a Japanese string in Safari doesn't
    distinguish between small and large kana. That amounts to a mistaken
    assumption (IMO) that a tertiary distinction is not important in
    distinguishing search terms for Japanese. In other words, because
    the search is built on an ICU collator set to ignore tertiary distinctions
    (i.e. it is effectively "case-folding" for matches), it is giving
    false positive matches where you think it shouldn't.

    There are various ways to handle this, including tailoring and
    language or script-specific differences in handling tertiary distinctions
    for the purposes of search terms. But it seems clear to me that
    it should *not* be "fixed" by changing the DUCET table at this point --
    as that would be guaranteed to upset actual collation and sorting
    by other applications, as well as disrupting the basis for any
    Japanese tailorings for UCA that may already exist.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Oct 12 2009 - 16:20:37 CDT