Re: Unicode and Java Questions

From: Mark Davis (mark@macchiato.com)
Date: Thu Oct 02 2008 - 01:41:09 CDT

  • Next message: Mike: "Re: Unicode and Java Questions"

    Mark

    On Thu, Oct 2, 2008 at 4:16 AM, Phillips, Addison <addison@amazon.com>wrote:

    >
    > 1) I want to standardize on a normalization form, but this sentence in
    > Annex 15 (Unicode Normalization Forms) gave me pause:
    >
    > "Normalization Forms KC and KD must not be blindly applied to arbitrary
    > text. Because they erase many formatting distinctions, they will prevent
    > round-trip conversion to and from many legacy character sets, and unless
    > supplanted by formatting markup, they may remove distinctions that are
    > important to the semantics of the text."
    >
    >
    >
    >
    >
    > If you are going to standardize on a normalization form, you should
    > standardize on NFC. You should know that there are cases in which NFC alters
    > data in ways incompatible with the best usage for certain (relatively rare,
    > minority) languages. But, generally, NFC is safe.
    >
    >
    >
    > NFKC is NOT safe. It alters data in a variety of ways. I can be very useful
    > in situations in which you mean to eliminate any ambiguity—namespaces are a
    > good example—but you cannot apply it blindly. Legacy encodings are just one
    > example.
    >

    Just to amplify this: the main place for using NFKC is in identifiers, when
    you want to erase (permanently) the differences between say full and
    half-width.

    >
    >
    > For example, suppose that I use NFKC on text that has both halfwidth "カ"
    > and fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth "カ"
    > (let's say). When I want to convert back to a Japanese-specific encoding, we
    > no longer know which ones are halfwidth and which ones are fullwidth. The
    > question is, how big of a deal is this in real-world, normal usage?
    >
    >
    >
    > That might not be a big deal (although it also can be), but other KC
    > normalizations deeply alter the text. For example, a circled digit becomes
    > just a number. Or the vulgar fractions like ½ become a sequence (1 / 2---so
    > Plan9's old windowing system would be 81/2 J). And so on and so forth. You
    > should NEVER apply NFKC to data blindly. It's a big deal.
    >
    >
    >
    > 2) Can string equivalence be both locale-agnostic and locale-sensitive?
    > That is, can two code points be equal in some language and not equal in
    > another; I am assuming some normalization form here. If this is possible,
    > doesn't that mean String.equals(...) should have a parameter for locale?
    >
    >
    >
    > It helps to read the Javadoc here. String.equals() is about
    > code-point-by-code-point comparison. Any two code points, considered in
    > isolation, will always be equal in all locales. In addition, any two strings
    > that contain the same code points in the same order are equal, regardless of
    > locale (sensitivity or not), even in Collator. But String's comparisons are
    > purely at the code point level.
    >
    >
    >
    > What Collator can do is compare strings as equal for a given weight that
    > are NOT equivalent code point sequences. For example, case differences or
    > accent differences might be ignored (at some weighting level), so you might
    > consider strings such as Muenchen/München or STRASSE/straße as equivalent.
    >
    >
    >
    > Also, doesn't this mean that all Collections should take a Collator as an
    > argument?
    >
    >
    >
    > No. Sometimes you'll want a strict code-point based comparison. It depends
    > on what you're doing with a Collection as to whether using a Collator is a
    > good idea. Collators are expensive compared to String's comparators, so if
    > your code's main purpose is merely to do things in **some** deterministic
    > order (but not necessarily for presentation to users), .compareTo() or
    > .compareToIgnoreCase() may very well be good enough. If you are, by
    > contrast, sorting someone's address book, yes, you'll need a Collator.
    >
    >
    > 3) Does it make sense to have locale-agnostic case conversion? Currently
    > I'm using ICU4J's Transliterator.getInstance("Any-Lower") and
    > Transliterator.getInstance("Any-Upper"). Is this correct?
    >
    >
    >
    > "It depends"
    >
    >
    >
    > Using Transliterator is probably overkill for most case insensitive
    > comparisons. There are equalsIgnoreCase() methods right in String that use
    > default case folding. Non-default case folding is very important---in some
    > locales (notably Turkic languages, Latvian, and a few others---see
    > SpecialCasing.txt in the UCD). But for many programmatic operations, you do
    > not want locale-sensitive case folding. It depends on why you are doing the
    > case folding. Is it for a language specific presentation? Then, probably,
    > you want to use the proper folding. Even then, I would REALLY question using
    > ICU4J. I mean, isn't String's toUpperCase(#locale) good enough for you?
    >

    The Transliterator in this case basically calls through to ICU's regular
    casing functions on UCharacter. So calling the latter would be faster. The
    time where you'd want to use the transliterator would be where you wanted
    some of the filtering or combination features, like

    Transliterator.getInstance("[:script=Greek:] Any-Lower"); // to only affect
    Greek script chars

    The UCharacter functions are roughly equivalent to the JDK functions on
    String; the difference with ICU4J is that you are always get the most
    up-to-date versions of Unicode, and ICU4J functions in general are faster.

    For collation, you always want to use ICU4J - the JDK functionality is far
    out of date and does not conform to the Unicode collation algorithm, and
    ICU4J is *much* faster (up to 30x, depending on locale/text).

    >
    > Now, turning it around for a second, you definitely should NEVER use
    > String.toUpperCase() or String.toLowerCase() without passing a locale
    > argument (new Locale("","") is a good locale to use for default behavior).
    > These methods use the system default locale. If you expect a
    > locale-insensitive operation to follow, you'll have peculiar code failures
    > in locales such as Turkish, where dotless/dotted "i" exists.
    >
    >
    >
    > Addison
    >
    >
    >
    >
    >
    > Addison Phillips
    >
    > Globalization Architect -- Lab126
    >
    > Chair -- W3C Internationalization Core WG
    >
    >
    >
    > Internationalization is not a feature.
    >
    > It is an architecture.
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Oct 02 2008 - 10:45:33 CDT