Re: Unicode and Java Questions

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Thu Oct 02 2008 - 03:37:13 CDT

Next message: Mark Davis: "Re: Unicode and Java Questions"

Previous message: Phillips, Addison: "RE: Unicode and Java Questions"
In reply to: Matt Chu: "Unicode and Java Questions"
Next in thread: Mike: "Re: Unicode and Java Questions"
Reply: Mike: "Re: Unicode and Java Questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

2008/10/2 Matt Chu <matt.chu@gmail.com>:

> For example, suppose that I use NFKC on text that has both halfwidth "ｶ" and
> fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth "カ"
> (let's say). When I want to convert back to a Japanese-specific encoding, we
> no longer know which ones are halfwidth and which ones are fullwidth. The
> question is, how big of a deal is this in real-world, normal usage?

Applying NFK(C|D) can change the meaning of text, e.g. 10³ becomes
103. Don't do that. NFK(C|D) might be used in user-oriented searching
where e.g. 3 might mtach ³, but it must not be forced on all strings.

NF(C|D) should be pretty safe, even though mixing normalization forms
can cause problems. It is not true that all processes treat
Unicode-equivalent strings as equal; it's far from it in reality.

> can two code points be equal in some language and not equal in another;
> I am assuming some normalization form here.

I am not sure what do you mean by equal here. In any case, NF(|K)(C|D)
does not depend on locale.

> If this is possible, doesn't
> that mean String.equals(...) should have a parameter for locale?

String.equals() in Java does not apply any normalization, it simply
compares sequences of UTF-16 code units.

> 3) Does it make sense to have locale-agnostic case conversion? Currently I'm
> using ICU4J's Transliterator.getInstance("Any-Lower") and
> Transliterator.getInstance("Any-Upper"). Is this correct?

It is wrong in very rare cases. In Turkish and Azeri these are case
pairs: İi, Iı. In Lithuanian accented i keeps its dot.

-- 
Marcin Kowalczyk
qrczak@knm.org.pl
http://qrnik.knm.org.pl/~qrczak/

Next message: Mark Davis: "Re: Unicode and Java Questions"
Previous message: Phillips, Addison: "RE: Unicode and Java Questions"
In reply to: Matt Chu: "Unicode and Java Questions"
Next in thread: Mike: "Re: Unicode and Java Questions"
Reply: Mike: "Re: Unicode and Java Questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 02 2008 - 03:43:00 CDT