Re: More Permanent Faults? - Unicode 5.0 Casefolding

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Jun 09 2006 - 18:10:25 CDT

  • Next message: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"

    Mark Davis wrote on Friday, June 09, 2006 at 9:10 PM

    > 1. The specification of the process by which the case folding mappings are
    > composed has already been fixed in Unicode 5.0, to note that the dotless i
    > is [and always has been] an exception.

    That removes the obvious bug status. Is there any way I could have known
    what the 5.0 text is?

    Of course, it seems perverse that the casefolding of the uppercasing of a
    casefolding should not be canonically equivalent to the original
    casefolding.

    I had struggled to work out exactly what a casefolding was. As uppercasing,
    titlecasing and lowercasing may be regarded as relationships on strings, I
    came to the conclusion that a casefolding was an idempotent function on
    strings that generated the equivalence class that is the equivalence class
    generated by the three casing functions.

    Under this interpretation, the default full casefolding is a casefolding
    derived from the default full lowercasing and the modification of the
    default full uppercasing in which U+0131 is not uppercased to U+0049.

    > If someone wants a case folding that
    > handles Turkic they have to tailor the case folding mappings to handled
    > them
    > slightly differently.

    And this may help discourage the use of the dotless small 'i' (U+0131) in
    the Gaelic subscript!

    > It would probably be a good idea to document this also
    > in the data file in the future.

    A good additional comment would probably be to uncomment out the line
           # 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
    in SpecialCasing.txt with the comment that it applies for all Turkish casing
    operations, and add a clone for Azer(baijan)i.

    > 2. I don't think you're interpreting the stability clause correctly. What
    > it
    > says is that if you have a string that is in NFKC form, and only contain
    > characters from Unicode version X, then its casefold will remain stable in
    > versions after X.

    I'm sorry you got the wrong impression. My big worry is that casefolding is
    not correct enough to freeze. We don't really want to have to add the
    notation of the 'human' locale so we can change it. One thing I am not
    clear about, though, is how many different casefoldings will be stable. Is
    it two - the default simple and full casefoldings?

    > But the sources you are starting with are not canonically equivalent:
    >
    > toNFD(U+1FB3 U+0304) = U+03B1 U+0304 U+0345
    > toNFD(U+1FB9 U+0399) = U+0391 U+0304 U+0399

    But

    toCasefold(toNFD(U+1FB3 U+0304)) = U+03B1 U+0304 U+03B9
    toCasefold(toNFD(U+1FB9 U+0399)) = U+03B1 U+0304 U+03B9

    so they are canonically caseless matches.

    The basic problem is that uppercasing and casefolding may not be
    Unicode-compliant processes, for the meaning of the resultant string depends
    on which of the canonically equivalent encodings is chosen. If the
    codepoint by codepoint conversions are performed without adjustment, this
    situation arises for:

    1. Uppercasing of <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI,
    U+0342 COMBINING GREEK PERISPOMENI> (not a normalised form). I believe this
    should uppercase to <U+0391, U+0342, U+0399> and casefold to <U+03B1,
    U+0342, U+03B9>, but the results should definitely at least be canonically
    equivalent to them, i.e. canonically equivalent to the uppercasing and
    casefolding of U+1FB7.

    2. <U+1FB3, U+0304>, which is NFC and NFKC, should uppercase to <U+0391,
    U+0304, U+0399> and casefold to <U+03B1, U+0304, U+03B9>.

    It is difficult to argue that the current definition requires that the
    default uppercasing move iota resulting from ypogegrammeni, and it seems
    impossible to argue it for default casefolding. However, if the definition
    is not modified for Unicode 5.0.0, it can never be corrected.

    I fear I may owe Theodore Smith an apology for insisting that one had to put
    the promoted iotas in the right place.

    Richard.



    This archive was generated by hypermail 2.1.5 : Fri Jun 09 2006 - 18:17:39 CDT