Re: More Permanent Faults? - Unicode 5.0 Casefolding

From: Richard Wordingham ([email protected])
Date: Thu Jun 08 2006 - 21:40:16 CDT

Next message: Doug Ewell: "Re: Case folding"

Previous message: Richard Wordingham: "Re: Case folding"
In reply to: Philippe Verdy: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Next in thread: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding - Correction on Lithuanian"
Reply: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding - Correction on Lithuanian"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy wrote on Thursday, June 08, 2006 11:37 PM
Subject: Re: More Permanent Faults? - Unicode 5.0 Casefolding

> From: "Richard Wordingham" [email protected]

> Actually, to compare strings for canonical caseless matches, one must
> calculate the *closure* of NFD() and toCasefold() transform fonctions.
> This means applying NFD() and toCasefold() alternatively as long as the
> result is still different. So this many mean:
> toCasefold(NFD(toCasefold(NFD(string)))) or even more applications of the
> functions.

The implication is that the composite map NFD·toCasefold·NFD is the closure;
do you have a counter-example in mind? I took the definition from TUS 3.13.

> I would certainly not do that; the common casefolding of a small dotless i
> is a small dotless i, not a normal (dotted) small i. This means that,
> outside of Turkic locales, a small dotless i does NOT matches a normal
> (dotted) small i in caseless searches, but it does match a (possibly
> Turkic...or not) dotted capital I.

Matching small dotless I and dotted capital I has only symmetry to recommend
it.

> But some caseless searches are implemented by actually comparing the
> result of:
> toUppercase(closure{NFD,toCaseFold}(string))

> This gives different results, because then it will match all 4 variants of
> i (small or capital, with or without dot). But this case is known and
> complicate to handle.

This looks like a way to get round a long-running inconsistency in the
casefolding of U+0131. I've been thinking a lot over the past few days
about just what casefolding means. The upper- and lower-casing functions
can be thought of as relationships on strings ('is the upper case form of'
and 'is the lower case form of'), and as such they generate an equivalence
relationship on strings. The casefolding function is then an idempotent
function such that
(a) 'is the casefolding form of' generates the above equivalence
relationship; and
(b) preserves canonical equivalence.

There are lots of discussions regarding this implicit dot (over small i or
small j, and the way it is transformed in combination with other
diacritics). In practice, the Turkic alternative for case mappings is not
complete in SpecialCasing.dat,and I think it is not really normative for
most protocols that need caseless compares: to be complete, one would need
to make the combining dot above completely ignorable when it is used after a
letter with an implicit dot in any of its letter case.

I believe only the default case-mappings and case-folding are normative.
The Turkic mappings have to be incomplete until one can determine what SMALL
LETTER I WITH ACUTE and the like capitalise to. Lithuanian lower-casing
does not preserve canonical equivalence, for U+00CF LATIN CAPITAL LETTER I
WITH DIAERESIS and its decomposition lower-case inequivalently by the rules.
Nevertheless, one can derive a Lithuanian case folding. In fact, one can
derive several reasonable-looking equivalent ones that meet the definition
above. The one that is algorithmically derivable does not preserve
canonical equivalence. However, but I have not double checked, there is an
equivalent Lithuanian case-folding that works by adding the following rules:

0307; L; After_Soft_Dotted; # COMBINING DOT ABOVE
00CC; L; 0069 0300; # LATIN CAPITAL LETTER I WITH GRAVE
00CD; L; 0069 0301; # LATIN CAPITAL LETTER I WITH ACUTE
00EC; L; 0069 0300; # LATIN SMALL LETTER I WITH GRAVE
00ED; L; 0069 0301; # LATIN SMALL LETTER I WITH ACUTE
0128; L; 0069 0303; # LATIN CAPITAL LETTER I WITH TILDE
0129; L; 0069 0303; # LATIN SMALL LETTER I WITH TILDE

It preserves canonical equivalence if you fix the issue of U+0131.

> Then consider the case of the Dutch ij ligated letter: should it match the
> ij letter pair? then how do you consider the dots that are written above
> the ij ligated letter? couldn't it be perceived as a diaeresis above a ij
> pair of letters? We are exactly on borderline cases.

I was just considering the formal requirement. As jou can't encode the
Dutch ligature that way, the issue doesn't arise. I agree that practically
you should do a compatibility decomposition on it, but that is not the
Unicode default casefolding.

> So, is, the caseFolding() operation really normative?

Yes, the *default* toCasefold() is normative.

> shouldn't it be reformulated using the standard Unicode collation
> algorithm, which is much less ambiguous and can handle much more languages
> than what SpecialCasing.txt is currently providing?

This would probably be more useful. TUS already suggests that approach.

Richard.

Next message: Doug Ewell: "Re: Case folding"
Previous message: Richard Wordingham: "Re: Case folding"
In reply to: Philippe Verdy: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Next in thread: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding - Correction on Lithuanian"
Reply: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding - Correction on Lithuanian"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jun 08 2006 - 21:45:59 CDT