From: Mark Davis (mark.davis@icu-project.org)
Date: Thu Sep 20 2007 - 12:44:58 CDT
A few observations.
1. IDNA does use NFKD. The mappings are duplicated in the spec,
because case mapping is also applied, and they are filtered because
characters that are disallowed before or after don't need to have
mappings. NFKD works well for identifiers, even ones that are more
"human language like", since the characters that behave oddly are
typically not allowed anyway.
2. It's important to be clear about folding for *matching* as being a
different kind of process than "normalization". When you are matching,
you don't actually alter the text that you store, instead, you
(logically) transform both the search text and the indexed text so
that a binary comparison erases distinctions that are less relevant to
matching. Matching may be language dependent also -- thus you may want
to match a-ring against aa for Danish. You also want to match the
other cases that are not canonical or compatiblity equivalences, such
as curly quote marks against straight quote marks. So while NFC is a
starting point for matching text, it isn't enough.
3. Matching for search can be tolerant of a certain degree of
imprecision. You could alter the mapping for ½ to <space>1/2<space>,
but it just simply doesn't matter much if 5½ folds to 51/2 for
searching, since you just won't get any appreciable number of false
positives, and users will just skip over the vanishingly small number
that are found.
4. What we've found is that using most of the NFKC mappings, plus case
folding, plus some of the UCA mappings, plus a few others, gives a
pretty good result for the language-independent matching.
(Language-dependent matching is more complicated.)
This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 12:47:21 CDT