Re: More Permanent Faults? - Unicode 5.0 Casefolding

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Jun 09 2006 - 18:10:25 CDT

Next message: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"

Previous message: Mark Davis: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
In reply to: Mark Davis: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Next in thread: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Mark Davis: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis wrote on Friday, June 09, 2006 at 9:10 PM

> 1. The specification of the process by which the case folding mappings are
> composed has already been fixed in Unicode 5.0, to note that the dotless i
> is [and always has been] an exception.

That removes the obvious bug status. Is there any way I could have known
what the 5.0 text is?

Of course, it seems perverse that the casefolding of the uppercasing of a
casefolding should not be canonically equivalent to the original
casefolding.

I had struggled to work out exactly what a casefolding was. As uppercasing,
titlecasing and lowercasing may be regarded as relationships on strings, I
came to the conclusion that a casefolding was an idempotent function on
strings that generated the equivalence class that is the equivalence class
generated by the three casing functions.

Under this interpretation, the default full casefolding is a casefolding
derived from the default full lowercasing and the modification of the
default full uppercasing in which U+0131 is not uppercased to U+0049.

> If someone wants a case folding that
> handles Turkic they have to tailor the case folding mappings to handled
> them
> slightly differently.

And this may help discourage the use of the dotless small 'i' (U+0131) in
the Gaelic subscript!

> It would probably be a good idea to document this also
> in the data file in the future.

A good additional comment would probably be to uncomment out the line
# 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
in SpecialCasing.txt with the comment that it applies for all Turkish casing
operations, and add a clone for Azer(baijan)i.

> 2. I don't think you're interpreting the stability clause correctly. What
> it
> says is that if you have a string that is in NFKC form, and only contain
> characters from Unicode version X, then its casefold will remain stable in
> versions after X.

I'm sorry you got the wrong impression. My big worry is that casefolding is
not correct enough to freeze. We don't really want to have to add the
notation of the 'human' locale so we can change it. One thing I am not
clear about, though, is how many different casefoldings will be stable. Is
it two - the default simple and full casefoldings?

> But the sources you are starting with are not canonically equivalent:
>
> toNFD(U+1FB3 U+0304) = U+03B1 U+0304 U+0345
> toNFD(U+1FB9 U+0399) = U+0391 U+0304 U+0399

But

toCasefold(toNFD(U+1FB3 U+0304)) = U+03B1 U+0304 U+03B9
toCasefold(toNFD(U+1FB9 U+0399)) = U+03B1 U+0304 U+03B9

so they are canonically caseless matches.

The basic problem is that uppercasing and casefolding may not be
Unicode-compliant processes, for the meaning of the resultant string depends
on which of the canonically equivalent encodings is chosen. If the
codepoint by codepoint conversions are performed without adjustment, this
situation arises for:

1. Uppercasing of <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI,
U+0342 COMBINING GREEK PERISPOMENI> (not a normalised form). I believe this
should uppercase to <U+0391, U+0342, U+0399> and casefold to <U+03B1,
U+0342, U+03B9>, but the results should definitely at least be canonically
equivalent to them, i.e. canonically equivalent to the uppercasing and
casefolding of U+1FB7.

2. <U+1FB3, U+0304>, which is NFC and NFKC, should uppercase to <U+0391,
U+0304, U+0399> and casefold to <U+03B1, U+0304, U+03B9>.

It is difficult to argue that the current definition requires that the
default uppercasing move iota resulting from ypogegrammeni, and it seems
impossible to argue it for default casefolding. However, if the definition
is not modified for Unicode 5.0.0, it can never be corrected.

I fear I may owe Theodore Smith an apology for insisting that one had to put
the promoted iotas in the right place.

Richard.

Next message: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Previous message: Mark Davis: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
In reply to: Mark Davis: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Next in thread: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Mark Davis: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 09 2006 - 18:17:39 CDT