Tentative Definition of Casefolding

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sat Jun 10 2006 - 18:09:15 CDT

Next message: SADAHIRO Tomoyuki: "Re: Tentative Definition of Casefolding"

Previous message: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
In reply to: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Next in thread: SADAHIRO Tomoyuki: "Re: Tentative Definition of Casefolding"
Reply: SADAHIRO Tomoyuki: "Re: Tentative Definition of Casefolding"
Reply: Philippe Verdy: "Re: Tentative Definition of Casefolding"
Maybe reply: Keutgen, Walter: "RE: Tentative Definition of Casefolding"
Maybe reply: Keutgen, Walter: "RE: Tentative Definition of Casefolding"
Maybe reply: Keutgen, Walter: "RE: Tentative Definition of Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Michael Everson wrote on Saturday, June 10, 2006 at 12:46 AM
Subject: Re: More Permanent Faults? - Unicode 5.0 Casefolding

> At 00:10 +0100 2006-06-10, Richard Wordingham wrote:

>>As uppercasing, titlecasing and lowercasing may be regarded as
>>relationships on strings, I came to the conclusion that a casefolding was
>>an idempotent function on strings that generated the equivalence class
>>that is the equivalence class generated by the three casing functions.

> Quite honestly, I think you ought to re-write this sentence in English
> which is comprehensible.

Here goes.

I came to the conclusion that a casefolding could be defined as follows.
Firstly, partition the set of all strings as finely as possible into
non-overlapping sets such that the uppercasing, lowercasing and titlecasing
of a string are all in the same set as the original string. Then a
casefolding is a function that maps all the members of each set to a member
of that set.

An equivalent, alternative formulation is:

Given an uppercasing function uc, a lowercasing function lc, and a
titlecasing function tc on strings, then a casefolding function is a
function f for which:

A: f(X) = f(uc(X)) = f(lc(X)) = f(tc(X)) = f(f(X)) for all strings X, and
B: f(X) = f(Y) only if they are equal for all functions satisfying condition
A.

Further Comments:

As far as I am aware, repeating the sequence uppercase, lowercase until the
string changes no more will provide a casefolding. I haven't investigated
pathological systems where this process does not converge. The Lithuanian
locale comes closest, with U+029D LATIN SMALL LETTER J WITH CROSSED-TAIL
followed by ten dots above taking 10 cycles to converge to plain U+029D.

There are other useful properties that a casefolding should have, but I'm
not sure whether I have quite captured them or indeed whether there will
always be such a casefolding. I am assuming that the purpose of casefolding
is simply to strip out casing information, making as little other change as
possible.

C: Unachievable Target:
If X and Y are canonically equivalent, so are f(X) and f(Y). This can fail
because one of the casing operations does not preserve canonical
equivalence.

C: Draft form:
If uc(X) and uc(Y) are canonically equivalent, lc(X) and lc(Y) are
canonically equivalent, and tc(X) and tc(Y) are canonically equivalent, so
are f(X) and f(Y).

D: Unachievable Target:
If X is the concatenation of X1 and X2, then f(X) is the concatenation of
f(X1) and f(X2). This fails with context sensitive casing. I need
something like it to generally stop 'dodo' casefolding to 'doDO'.

D: Draft form:
If X is the concatenation of X1 and X2, lc(X) is the concatenation of lc(X1)
and lc(X2), uc(X) is the concatenation of uc(X1) and uc(X2), and tc(X) is
the concatenation of tc(X1) and lc(X2) [N.B. lc, not tc!] then f(X) is the
concatenation of f(X1) and f(X2).

K: Unachievable Target:
If X and Y are compatibility equivalent, so are f(X) and f(Y). This is
unachievable for all the reasons applicable to property (C).

K: Draft Form:
If uc(X) and uc(Y) are compatibility equivalent, lc(X) and lc(Y) are
compatibility equivalent, and tc(X) and tc(Y) are compatibility equivalent,
so are f(X) and f(Y).

S: f(X) should not be more than the three times the size of X, whether
measured in codepoints, UTF-8 code units or UTF-16 code units.

Condition S would have some implications for Lithuanian casing **modified**
to make it preserve canonical equivalence. As U+1E2E LATIN CAPITAL LETTER I
WITH DIAERESIS AND ACUTE would then lowercase to <U+0069, U+0307, U+0308,
U+0301>, it would have to fold to <U+0049 LATIN CAPITAL LETTER I, U+0308,
U+0301>, with many ramifications. (This was the issue that got me
interested in case-folding.) However, does anyone here know how Lithuanians
actually write a diaeresis on the letter 'i'?

Richard.

Next message: SADAHIRO Tomoyuki: "Re: Tentative Definition of Casefolding"
Previous message: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
In reply to: Michael Everson: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Next in thread: SADAHIRO Tomoyuki: "Re: Tentative Definition of Casefolding"
Reply: SADAHIRO Tomoyuki: "Re: Tentative Definition of Casefolding"
Reply: Philippe Verdy: "Re: Tentative Definition of Casefolding"
Maybe reply: Keutgen, Walter: "RE: Tentative Definition of Casefolding"
Maybe reply: Keutgen, Walter: "RE: Tentative Definition of Casefolding"
Maybe reply: Keutgen, Walter: "RE: Tentative Definition of Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jun 10 2006 - 18:17:02 CDT