RE: Unicode and Java Questions

From: Phillips, Addison (addison@amazon.com)
Date: Thu Oct 02 2008 - 19:46:39 CDT

  • Next message: John W Kennedy: "Re: Unicode and Java Questions"


    1) There DOES exist language-dependent string equivalence, as well as Java's built-in language-independent string equivalence. That is, the follow situation exists:

    x = "\uXXXX";
    y = "\uYYYY";
    if (locale == A) then x == y else x != y

    No. What we’re saying is you can have:

    Collator col = Collator.getInstance(locale);
    if (x.equals(y)) {
       assert col.compare(x, y)==0 : “this never throws an assertion error”;
    }
     assert col.compare(x, y)!=0 : “this can throw an assertion error because some unequal strings compare as equal”;
    }

    The “equals” and “compareTo” methods in String are NOT locale sensitive. See http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#compareTo(java.lang.String)


    2) Given that (1) is true and .equals changes based on locale, then doesn't that mean I have to override .hashCode in order to maintain the Java

    (1) Is false…..


    Map<String, Boolean> map = new HashMap<String, Boolean>(Locale.GERMAN);
    map.put("STRASSE", true);
    map.put("STRAßE", true);
    System.out.println("size = " + map.size()); // I want this to print ONE, not two

    Those strings are not equal. :-)

    3) So I know that there exists some values locale1, locale2, and s such that:

    Locale locale1 = ...;
    Locale locale2 = ...;
    String s = "...";
    s.toLowerCase(locale1) != s.toLowerCase(locale2)


    is true.

    And I know that .toLowerCase()/.toUpperCase() is inherently language-dependent, where the locale is inferred from the JVM/environment.
    That’s correct, although we would say “locale-dependent”.


    I'm trying to ask if *language-independent* case *conversions* (not case-folding) exists. That is:

    s.toLowerCase(Locale.NULL)

    or something like that. I guess I'm not sure on how to use the algorithms for case-folding with case conversion, and whether or not its even appropriate. If case conversion is not appropriate, would I be correct in that the right way to do it is to wrap string in ICU4J's CaseInsensitiveString class?
    Sure: there is a default case-folding. Locale.ROOT (in 1.6 or later) or “new Locale(“”,””)” gives you this default case mapping.



    Also, I'm on JDK5, so I don't have Locale.ROOT, but I don't fully understand what new Locale("") does in toUpperCase/toLowerCase; is this the language-independent case conversion I'm looking for?
    The locale with the empty strings in the constructor (called the root locale colloquially) is something like the C/POSIX locale is in C/C++. It is an English-like locale with mostly default behavior (where default behavior exists). It is also the locale where your root (source) resource bundles live.
    You also might want to take this thread to i18n-prog@yahoogroups.com<mailto:i18n-prog@yahoogroups.com> (where internationalization programming is the topic of the day).
    Regards,
    Addison

    Addison Phillips
    Globalization Architect -- Lab126
    Chair -- W3C Internationalization Core WG

    Internationalization is not a feature.
    It is an architecture.




    --
    Naoto Sato



    This archive was generated by hypermail 2.1.5 : Thu Oct 02 2008 - 19:49:53 CDT