Re: More Permanent Faults? - Unicode 5.0 Casefolding

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Jun 09 2006 - 19:10:30 CDT

Next message: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit"

Previous message: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
In reply to: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Next in thread: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> The basic problem is that uppercasing and casefolding may not be
> Unicode-compliant processes, for the meaning of the resultant string
depends
> on which of the canonically equivalent encodings is chosen.

There is a slight misunderstanding here. C9 is carefully phrased:

C9 A process shall not assume that the interpretations of two
canonical-equivalent character
sequences are distinct.
• The implications of this conformance clause are twofold. First, a process
is never
required to give different interpretations to two different, but
canonical-equivalent
character sequences. Second, no process can assume that another process will
make
a distinction between two different, but canonical-equivalent character
sequences.
• Ideally, an implementation would always interpret two canonical-equivalent
character
sequences identically. There are practical circumstances under which
implementations
may reasonably distinguish them.

C9 basically says that you should respect canonical equivalence, and you
should be prepared for any other process to respect it. In the standard we
supply case folding operations that do not, in themselves, require
normalization, but in edge cases may not respect canonical equivalence.
While we strongly encourage that all processing respect canonical
equivalence, but recognize that for some common tasks like case folding,
people may not want to take on the extra performance / code-complicating of
adding normalization, to handle a small number of edge cases. But we also
define forms of case folding that do, in fact, respect canonical
equivalence.

Mark

On 6/9/06, Richard Wordingham <richard.wordingham@ntlworld.com> wrote:
> Mark Davis wrote on Friday, June 09, 2006 at 9:10 PM
>
> > 1. The specification of the process by which the case folding mappings
are
> > composed has already been fixed in Unicode 5.0, to note that the dotless
i
> > is [and always has been] an exception.
>
> That removes the obvious bug status. Is there any way I could have known
> what the 5.0 text is?
>
> Of course, it seems perverse that the casefolding of the uppercasing of a
> casefolding should not be canonically equivalent to the original
> casefolding.
>
> I had struggled to work out exactly what a casefolding was. As
uppercasing,
> titlecasing and lowercasing may be regarded as relationships on strings, I
> came to the conclusion that a casefolding was an idempotent function on
> strings that generated the equivalence class that is the equivalence class
> generated by the three casing functions.
>
> Under this interpretation, the default full casefolding is a casefolding
> derived from the default full lowercasing and the modification of the
> default full uppercasing in which U+0131 is not uppercased to U+0049.
>
> > If someone wants a case folding that
> > handles Turkic they have to tailor the case folding mappings to handled
> > them
> > slightly differently.
>
> And this may help discourage the use of the dotless small 'i' (U+0131) in
> the Gaelic subscript!
>
> > It would probably be a good idea to document this also
> > in the data file in the future.
>
> A good additional comment would probably be to uncomment out the line
> # 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
> in SpecialCasing.txt with the comment that it applies for all Turkish
casing
> operations, and add a clone for Azer(baijan)i.
>
> > 2. I don't think you're interpreting the stability clause correctly.
What
> > it
> > says is that if you have a string that is in NFKC form, and only contain
> > characters from Unicode version X, then its casefold will remain stable
in
> > versions after X.
>
> I'm sorry you got the wrong impression. My big worry is that casefolding
is
> not correct enough to freeze. We don't really want to have to add the
> notation of the 'human' locale so we can change it. One thing I am not
> clear about, though, is how many different casefoldings will be
stable. Is
> it two - the default simple and full casefoldings?
>
> > But the sources you are starting with are not canonically equivalent:
> >
> > toNFD(U+1FB3 U+0304) = U+03B1 U+0304 U+0345
> > toNFD(U+1FB9 U+0399) = U+0391 U+0304 U+0399
>
> But
>
> toCasefold(toNFD(U+1FB3 U+0304)) = U+03B1 U+0304 U+03B9
> toCasefold(toNFD(U+1FB9 U+0399)) = U+03B1 U+0304 U+03B9
>
> so they are canonically caseless matches.
>
> The basic problem is that uppercasing and casefolding may not be
> Unicode-compliant processes, for the meaning of the resultant string
depends
> on which of the canonically equivalent encodings is chosen. If the
> codepoint by codepoint conversions are performed without adjustment, this
> situation arises for:
>
> 1. Uppercasing of <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI,
> U+0342 COMBINING GREEK PERISPOMENI> (not a normalised form). I believe
this
> should uppercase to <U+0391, U+0342, U+0399> and casefold to <U+03B1,
> U+0342, U+03B9>, but the results should definitely at least be canonically

> equivalent to them, i.e. canonically equivalent to the uppercasing and
> casefolding of U+1FB7.
>
> 2. <U+1FB3, U+0304>, which is NFC and NFKC, should uppercase to <U+0391,
> U+0304, U+0399> and casefold to <U+03B1, U+0304, U+03B9>.
>
> It is difficult to argue that the current definition requires that the
> default uppercasing move iota resulting from ypogegrammeni, and it seems
> impossible to argue it for default casefolding. However, if the
definition
> is not modified for Unicode 5.0.0, it can never be corrected.
>
> I fear I may owe Theodore Smith an apology for insisting that one had to
put
> the promoted iotas in the right place.
>
> Richard.
>
>
>

Next message: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit"
Previous message: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
In reply to: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Next in thread: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 09 2006 - 19:12:46 CDT