Re: Japanese text handling problem in Unicode Collation Algorithm

From: Kent Karlsson (kent.karlsson14@comhem.se)
Date: Wed Oct 14 2009 - 04:14:33 CDT

  • Next message: Andrew West: "Re: Japanese text handling problem in Unicode Collation Algorithm"

    Den 2009-10-13 16.48, skrev "Satoshi Nakagawa" <psychs@limechat.net>:

    > My point is the difference between small kana letters and big kana
    > letters is weaker than the difference between uppercase and lowercase

    Do you mean "stronger"?

    > in latin alphabets.
    >
    > You can see the fact in Google search.
    >
    > [あつた]
    > http://www.google.com/search?q=%E3%81%82%E3%81%A4%E3%81%9F
    >
    > [あった]
    > http://www.google.com/search?q=%E3%81%82%E3%81%A3%E3%81%9F

    ok.

    > These two queries show completely different results, while [konig] and
    > [König] return the same results.

    I don't get exactly the same results, but ö and o do get mixed up
    (as does k and K).

    And I think that is a major problem in web search these days.
    The letters ö and o are "completely different letters", and in my
    everyday usage they are as related as e and u. In my usage, ö
    also collates at the end of the alphabet, not at all near o.

    The case is similar for other (apparent) diacritics.

    This may be fine if you don't know how to spell a certain
    word. But if you do know the spelling, the current approach
    gives a lot of false positives for these kinds of searches.

        /kent k



    This archive was generated by hypermail 2.1.5 : Wed Oct 14 2009 - 04:19:09 CDT