RE: Unicode and Java Questions

From: Phillips, Addison (addison@amazon.com)
Date: Wed Oct 01 2008 - 21:16:28 CDT

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Unicode and Java Questions"


    1) I want to standardize on a normalization form, but this sentence in Annex 15 (Unicode Normalization Forms) gave me pause:

    "Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text."


    If you are going to standardize on a normalization form, you should standardize on NFC. You should know that there are cases in which NFC alters data in ways incompatible with the best usage for certain (relatively rare, minority) languages. But, generally, NFC is safe.

    NFKC is NOT safe. It alters data in a variety of ways. I can be very useful in situations in which you mean to eliminate any ambiguity—namespaces are a good example—but you cannot apply it blindly. Legacy encodings are just one example.


    For example, suppose that I use NFKC on text that has both halfwidth "カ" and fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth "カ" (let's say). When I want to convert back to a Japanese-specific encoding, we no longer know which ones are halfwidth and which ones are fullwidth. The question is, how big of a deal is this in real-world, normal usage?

    That might not be a big deal (although it also can be), but other KC normalizations deeply alter the text. For example, a circled digit becomes just a number. Or the vulgar fractions like ½ become a sequence (1 / 2---so Plan9’s old windowing system would be 81/2 ☺). And so on and so forth. You should NEVER apply NFKC to data blindly. It’s a big deal.


    2) Can string equivalence be both locale-agnostic and locale-sensitive? That is, can two code points be equal in some language and not equal in another; I am assuming some normalization form here. If this is possible, doesn't that mean String.equals(...) should have a parameter for locale?

    It helps to read the Javadoc here. String.equals() is about code-point-by-code-point comparison. Any two code points, considered in isolation, will always be equal in all locales. In addition, any two strings that contain the same code points in the same order are equal, regardless of locale (sensitivity or not), even in Collator. But String’s comparisons are purely at the code point level.

    What Collator can do is compare strings as equal for a given weight that are NOT equivalent code point sequences. For example, case differences or accent differences might be ignored (at some weighting level), so you might consider strings such as Muenchen/München or STRASSE/straße as equivalent.

    Also, doesn't this mean that all Collections should take a Collator as an argument?

    No. Sometimes you’ll want a strict code-point based comparison. It depends on what you’re doing with a Collection as to whether using a Collator is a good idea. Collators are expensive compared to String’s comparators, so if your code’s main purpose is merely to do things in *some* deterministic order (but not necessarily for presentation to users), .compareTo() or .compareToIgnoreCase() may very well be good enough. If you are, by contrast, sorting someone’s address book, yes, you’ll need a Collator.

    3) Does it make sense to have locale-agnostic case conversion? Currently I'm using ICU4J's Transliterator.getInstance("Any-Lower") and Transliterator.getInstance("Any-Upper"). Is this correct?

    “It depends”

    Using Transliterator is probably overkill for most case insensitive comparisons. There are equalsIgnoreCase() methods right in String that use default case folding. Non-default case folding is very important---in some locales (notably Turkic languages, Latvian, and a few others---see SpecialCasing.txt in the UCD). But for many programmatic operations, you do not want locale-sensitive case folding. It depends on why you are doing the case folding. Is it for a language specific presentation? Then, probably, you want to use the proper folding. Even then, I would REALLY question using ICU4J. I mean, isn’t String’s toUpperCase(#locale) good enough for you?

    Now, turning it around for a second, you definitely should NEVER use String.toUpperCase() or String.toLowerCase() without passing a locale argument (new Locale(“”,””) is a good locale to use for default behavior). These methods use the system default locale. If you expect a locale-insensitive operation to follow, you’ll have peculiar code failures in locales such as Turkish, where dotless/dotted “i" exists.


    Addison


    Addison Phillips
    Globalization Architect -- Lab126
    Chair -- W3C Internationalization Core WG

    Internationalization is not a feature.
    It is an architecture.




    This archive was generated by hypermail 2.1.5 : Wed Oct 01 2008 - 21:20:14 CDT