From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 16 2003 - 18:14:26 EST
> Here's what happens exactly:
Note the rules in CaseFolding.txt:
0049; C; 0069; # CAPITAL (dotless) I -> SMALL (soft-dotted)
I
0049; T; 0131; # CAPITAL (dotless) I -> SMALL DOTLESS I
0130; F; 0069 0307; # CAPITAL I WITH DOT -> SMALL (soft-dotted)
I, DOT
0130; T; 0069; # CAPITAL I WITH DOT -> SMALL (soft-dotted)
I
But also that the other 'i's are mapped to themselves by default.
There's no explicit Casefolding mapping defined for them so we also have
currently these defaults:
0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted)
I
0130; C; 0130; # CAPITAL I WITH DOT -> CAPITAL I WITH DOT
0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I
And we also have the explitly dotted Turkic lowercase i, whose folding is
defined by the 5th of all rules above (thanks, there's no canonical
equivalence between 0069 0307 and 0069):
0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted)
I, DOT
And for the decomposition of the Turkic dotted uppercase I, case folding is
defined by the 1st or 2nd of all rules above (note that 0049 0307 and 0130
should be canonically equivalent, and should produce identical case foldings
with the 3rd or 4th rules above, to preserve canonical equivalence):
0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT -> SMALL (soft-dotted)
I, DOT
0049 0307; T; 0131 0307; # CAPITAL (dotless) I, DOT -> SMALL DOTLESS I,
DOT
********************************************************
Now let's look at each CaseFolding type, and look at their result:
------------------------------------
(1) Mappings for Simple CaseFolding:
------------------------------------
(1.1) First class of source strings:
0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I
(1.2) Second class of source strings:
0049; C; 0069; # CAPITAL (dotless) I -> SMALL (soft-dotted)
I
0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted)
I
(1.3) Third class of source strings:
0130; C; 0130; # CAPITAL I WITH DOT -> CAPITAL I WITH DOT
(1.4) Fourth class of source strings:
0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT -> SMALL (soft-dotted)
I, DOT
0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted)
I, DOT
Do these classes resist (don't merge or split) with uppercase/titlecase or
lowercase?
(1.1) 0131; lower=0131 ; upper/title=0131
(1.2) 0049; lower=0069 ; upper/title=0049
(1.2) 0069; lower=0069 ; upper/title=0049
(1.3) 0130; lower=0130 ; upper/title=0130
(1.4) 0049 0307; lower=0069 0307; upper/title=0049 0307
(1.4) 0069 0307; lower=0069 0307; upper/title=0049 0307
OK, there's no merge, so no problem with Simple CaseFolding, which resist to
case mappings.
------------------------------------
(2) Mappings for Turkic CaseFolding:
------------------------------------
(2.1) First class of source strings:
0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I
0049; T; 0131; # CAPITAL (dotless) I -> SMALL DOTLESS I
(2.2) Second class of source strings:
0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted)
I
0130; T; 0069; # CAPITAL I WITH DOT -> SMALL (soft-dotted)
I
(2.3) Third class of source strings:
0049 0307; T; 0131 0307; # CAPITAL (dotless) I, DOT -> SMALL DOTLESS I,
DOT
(2.4) Fourth class of source strings:
0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted)
I, DOT
Do these classes resist (don't merge or split) with common
uppercase/titlecase or lowercase mappings?
(2.1) 0131; C; lower=0131 ; upper/title=0131
(2.1) 0049; C; lower=0069 ; upper/title=0049
(2.2) 0069; C; lower=0069 ; upper/title=0049
(2.2) 0130; C; lower=0130 ; upper/title=0130
(2.3) 0049 0307; C; lower=0069 0307; upper/title=0049 0307
(2.4) 0069 0307; C; lower=0069 0307; upper/title=0049 0307
Problem here: uppercase mappings do not follow case folding rules.
We would also need Turkic-specific mappings for upper/title case:
(2.1) 0131; T; upper/title=0049
(2.1) 0049; C; upper/title=0049
(2.2) 0069; T; upper/title=0130
(2.2) 0130; C; upper/title=0130
(2.3) 0049 0307; T; upper/title=0049 0307 (=0130 ?)
(2.4) 0069 0307; T; upper/title=0130 0307 (=0130 ?)
But we would need then to define canonical equivalence between 0130 and 0049
0307 and 0130 0307 to preserve canonical equivalence... So Turkic
CaseFoldings would be broken, unless we say that Turkish texts should NOT be
encoded with 0307, but only with 0049, 0069, 0130 or 0131. So Turkic
CaseFolding rules should also avoid generating any 0307, whose behavior is
not clear.
If we just remove any 0307 from the Turkic texts, there is absolutely no
problem with Turkic CaseFolding, provided that we also define
Turkic-specific uppercase mappings as done above, and don't use the default
locale-neutral uppercase mappings of the UCD.
------------------------------------
(3) Mappings for Full CaseFolding:
------------------------------------
(3.1) First class of source strings:
0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I
(3.2) Second class of source strings:
0049; C; 0069; # CAPITAL (dotless) I -> SMALL (soft-dotted)
I
0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted)
I
(3.3) Third class of source strings:
0130; F; 0069 0307; # CAPITAL I WITH DOT -> SMALL (soft-dotted)
I, DOT
0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT -> SMALL (soft-dotted)
I, DOT
0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted)
I, DOT
Do these classes resist (don't merge or split) with common
uppercase/titlecase or lowercase mappings?
(3.1) 0131; C; lower=0131 ; upper/title=0131
(3.2) 0049; C; lower=0069 ; upper/title=0049
(3.2) 0069; C; lower=0069 ; upper/title=0049
(3.3) 0130; C; lower=0130 ; upper/title=0130
(3.3) 0049 0307; C; lower=0069 0307; upper/title=0049 0307
(3.3) 0069 0307; C; lower=0069 0307; upper/title=0049 0307
Here the Full CaseFolding rules seems to be broken as they don't resist to
uppercase mappings.
There's only one way where they would be valid, only if uppercase mappings
where also altered, so that the uppercase of 0130 (which is already
uppercase) is 0049 0307 (impossible to do as uppercase mappings in the UCD
are restricted to 1 character).
The only remaining way to achieve it is to make them canonical equivalents
to represent a uppercase dotted I. Thanks, we find this in the UCD, which
defines exactly that canonical equivalence:
0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN
CAPITAL LETTER I DOT;;;0069;
Good. Full CaseFolding are not broken, but they require the support of
canonical equivalence of decompositions for dotted uppercase I. Using Full
CaseMapping correctly requires being able to use normalization on its
output.
However care must be taken because Turkic case may have been converted in
the past to uppercase, using Turkic rules, and this information is lost if
language is not clearly identifiable.
__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com
This archive was generated by hypermail 2.1.5 : Tue Dec 16 2003 - 18:59:16 EST