Re: Unicode and Java Questions

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Thu Oct 02 2008 - 03:37:13 CDT

  • Next message: Mark Davis: "Re: Unicode and Java Questions"

    2008/10/2 Matt Chu <matt.chu@gmail.com>:

    > For example, suppose that I use NFKC on text that has both halfwidth "カ" and
    > fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth "カ"
    > (let's say). When I want to convert back to a Japanese-specific encoding, we
    > no longer know which ones are halfwidth and which ones are fullwidth. The
    > question is, how big of a deal is this in real-world, normal usage?

    Applying NFK(C|D) can change the meaning of text, e.g. 10³ becomes
    103. Don't do that. NFK(C|D) might be used in user-oriented searching
    where e.g. 3 might mtach ³, but it must not be forced on all strings.

    NF(C|D) should be pretty safe, even though mixing normalization forms
    can cause problems. It is not true that all processes treat
    Unicode-equivalent strings as equal; it's far from it in reality.

    > can two code points be equal in some language and not equal in another;
    > I am assuming some normalization form here.

    I am not sure what do you mean by equal here. In any case, NF(|K)(C|D)
    does not depend on locale.

    > If this is possible, doesn't
    > that mean String.equals(...) should have a parameter for locale?

    String.equals() in Java does not apply any normalization, it simply
    compares sequences of UTF-16 code units.

    > 3) Does it make sense to have locale-agnostic case conversion? Currently I'm
    > using ICU4J's Transliterator.getInstance("Any-Lower") and
    > Transliterator.getInstance("Any-Upper"). Is this correct?

    It is wrong in very rare cases. In Turkish and Azeri these are case
    pairs: İi, Iı. In Lithuanian accented i keeps its dot.

    -- 
    Marcin Kowalczyk
    qrczak@knm.org.pl
    http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Thu Oct 02 2008 - 03:43:00 CDT