Re: More Permanent Faults? - Unicode 5.0 Casefolding

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Jun 09 2006 - 19:10:30 CDT

  • Next message: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit"

    > The basic problem is that uppercasing and casefolding may not be
    > Unicode-compliant processes, for the meaning of the resultant string
    depends
    > on which of the canonically equivalent encodings is chosen.

    There is a slight misunderstanding here. C9 is carefully phrased:

    C9 A process shall not assume that the interpretations of two
    canonical-equivalent character
    sequences are distinct.
    • The implications of this conformance clause are twofold. First, a process
    is never
    required to give different interpretations to two different, but
    canonical-equivalent
    character sequences. Second, no process can assume that another process will
    make
    a distinction between two different, but canonical-equivalent character
    sequences.
    • Ideally, an implementation would always interpret two canonical-equivalent
    character
    sequences identically. There are practical circumstances under which
    implementations
    may reasonably distinguish them.

    C9 basically says that you should respect canonical equivalence, and you
    should be prepared for any other process to respect it. In the standard we
    supply case folding operations that do not, in themselves, require
    normalization, but in edge cases may not respect canonical equivalence.
    While we strongly encourage that all processing respect canonical
    equivalence, but recognize that for some common tasks like case folding,
    people may not want to take on the extra performance / code-complicating of
    adding normalization, to handle a small number of edge cases. But we also
    define forms of case folding that do, in fact, respect canonical
    equivalence.

    Mark

    On 6/9/06, Richard Wordingham <richard.wordingham@ntlworld.com> wrote:
    > Mark Davis wrote on Friday, June 09, 2006 at 9:10 PM
    >
    > > 1. The specification of the process by which the case folding mappings
    are
    > > composed has already been fixed in Unicode 5.0, to note that the dotless
    i
    > > is [and always has been] an exception.
    >
    > That removes the obvious bug status. Is there any way I could have known
    > what the 5.0 text is?
    >
    > Of course, it seems perverse that the casefolding of the uppercasing of a
    > casefolding should not be canonically equivalent to the original
    > casefolding.
    >
    > I had struggled to work out exactly what a casefolding was. As
    uppercasing,
    > titlecasing and lowercasing may be regarded as relationships on strings, I
    > came to the conclusion that a casefolding was an idempotent function on
    > strings that generated the equivalence class that is the equivalence class
    > generated by the three casing functions.
    >
    > Under this interpretation, the default full casefolding is a casefolding
    > derived from the default full lowercasing and the modification of the
    > default full uppercasing in which U+0131 is not uppercased to U+0049.
    >
    > > If someone wants a case folding that
    > > handles Turkic they have to tailor the case folding mappings to handled
    > > them
    > > slightly differently.
    >
    > And this may help discourage the use of the dotless small 'i' (U+0131) in
    > the Gaelic subscript!
    >
    > > It would probably be a good idea to document this also
    > > in the data file in the future.
    >
    > A good additional comment would probably be to uncomment out the line
    > # 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
    > in SpecialCasing.txt with the comment that it applies for all Turkish
    casing
    > operations, and add a clone for Azer(baijan)i.
    >
    > > 2. I don't think you're interpreting the stability clause correctly.
    What
    > > it
    > > says is that if you have a string that is in NFKC form, and only contain
    > > characters from Unicode version X, then its casefold will remain stable
    in
    > > versions after X.
    >
    > I'm sorry you got the wrong impression. My big worry is that casefolding
    is
    > not correct enough to freeze. We don't really want to have to add the
    > notation of the 'human' locale so we can change it. One thing I am not
    > clear about, though, is how many different casefoldings will be
    stable. Is
    > it two - the default simple and full casefoldings?
    >
    > > But the sources you are starting with are not canonically equivalent:
    > >
    > > toNFD(U+1FB3 U+0304) = U+03B1 U+0304 U+0345
    > > toNFD(U+1FB9 U+0399) = U+0391 U+0304 U+0399
    >
    > But
    >
    > toCasefold(toNFD(U+1FB3 U+0304)) = U+03B1 U+0304 U+03B9
    > toCasefold(toNFD(U+1FB9 U+0399)) = U+03B1 U+0304 U+03B9
    >
    > so they are canonically caseless matches.
    >
    > The basic problem is that uppercasing and casefolding may not be
    > Unicode-compliant processes, for the meaning of the resultant string
    depends
    > on which of the canonically equivalent encodings is chosen. If the
    > codepoint by codepoint conversions are performed without adjustment, this
    > situation arises for:
    >
    > 1. Uppercasing of <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI,
    > U+0342 COMBINING GREEK PERISPOMENI> (not a normalised form). I believe
    this
    > should uppercase to <U+0391, U+0342, U+0399> and casefold to <U+03B1,
    > U+0342, U+03B9>, but the results should definitely at least be canonically

    > equivalent to them, i.e. canonically equivalent to the uppercasing and
    > casefolding of U+1FB7.
    >
    > 2. <U+1FB3, U+0304>, which is NFC and NFKC, should uppercase to <U+0391,
    > U+0304, U+0399> and casefold to <U+03B1, U+0304, U+03B9>.
    >
    > It is difficult to argue that the current definition requires that the
    > default uppercasing move iota resulting from ypogegrammeni, and it seems
    > impossible to argue it for default casefolding. However, if the
    definition
    > is not modified for Unicode 5.0.0, it can never be corrected.
    >
    > I fear I may owe Theodore Smith an apology for insisting that one had to
    put
    > the promoted iotas in the right place.
    >
    > Richard.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jun 09 2006 - 19:12:46 CDT