From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Thu Oct 02 2008 - 03:37:13 CDT
2008/10/2 Matt Chu <matt.chu@gmail.com>:
> For example, suppose that I use NFKC on text that has both halfwidth "カ" and
> fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth "カ"
> (let's say). When I want to convert back to a Japanese-specific encoding, we
> no longer know which ones are halfwidth and which ones are fullwidth. The
> question is, how big of a deal is this in real-world, normal usage?
Applying NFK(C|D) can change the meaning of text, e.g. 10³ becomes
103. Don't do that. NFK(C|D) might be used in user-oriented searching
where e.g. 3 might mtach ³, but it must not be forced on all strings.
NF(C|D) should be pretty safe, even though mixing normalization forms
can cause problems. It is not true that all processes treat
Unicode-equivalent strings as equal; it's far from it in reality.
> can two code points be equal in some language and not equal in another;
> I am assuming some normalization form here.
I am not sure what do you mean by equal here. In any case, NF(|K)(C|D)
does not depend on locale.
> If this is possible, doesn't
> that mean String.equals(...) should have a parameter for locale?
String.equals() in Java does not apply any normalization, it simply
compares sequences of UTF-16 code units.
> 3) Does it make sense to have locale-agnostic case conversion? Currently I'm
> using ICU4J's Transliterator.getInstance("Any-Lower") and
> Transliterator.getInstance("Any-Upper"). Is this correct?
It is wrong in very rare cases. In Turkish and Azeri these are case
pairs: İi, Iı. In Lithuanian accented i keeps its dot.
-- Marcin Kowalczyk qrczak@knm.org.pl http://qrnik.knm.org.pl/~qrczak/
This archive was generated by hypermail 2.1.5 : Thu Oct 02 2008 - 03:43:00 CDT