RE: Case mapping of dotless lowercase letters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 16 2003 - 18:14:26 EST

  • Next message: Peter Kirk: "Re: Case mapping of dotless lowercase letters"

    > Here's what happens exactly:

    Note the rules in CaseFolding.txt:

    0049; C; 0069; # CAPITAL (dotless) I -> SMALL (soft-dotted)
    I
    0049; T; 0131; # CAPITAL (dotless) I -> SMALL DOTLESS I
    0130; F; 0069 0307; # CAPITAL I WITH DOT -> SMALL (soft-dotted)
    I, DOT
    0130; T; 0069; # CAPITAL I WITH DOT -> SMALL (soft-dotted)
    I

    But also that the other 'i's are mapped to themselves by default.
    There's no explicit Casefolding mapping defined for them so we also have
    currently these defaults:

    0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted)
    I
    0130; C; 0130; # CAPITAL I WITH DOT -> CAPITAL I WITH DOT
    0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I

    And we also have the explitly dotted Turkic lowercase i, whose folding is
    defined by the 5th of all rules above (thanks, there's no canonical
    equivalence between 0069 0307 and 0069):

    0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted)
    I, DOT

    And for the decomposition of the Turkic dotted uppercase I, case folding is
    defined by the 1st or 2nd of all rules above (note that 0049 0307 and 0130
    should be canonically equivalent, and should produce identical case foldings
    with the 3rd or 4th rules above, to preserve canonical equivalence):

    0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT -> SMALL (soft-dotted)
    I, DOT
    0049 0307; T; 0131 0307; # CAPITAL (dotless) I, DOT -> SMALL DOTLESS I,
    DOT

    ********************************************************

    Now let's look at each CaseFolding type, and look at their result:

    ------------------------------------
    (1) Mappings for Simple CaseFolding:
    ------------------------------------
    (1.1) First class of source strings:
    0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I
    (1.2) Second class of source strings:
    0049; C; 0069; # CAPITAL (dotless) I -> SMALL (soft-dotted)
    I
    0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted)
    I
    (1.3) Third class of source strings:
    0130; C; 0130; # CAPITAL I WITH DOT -> CAPITAL I WITH DOT
    (1.4) Fourth class of source strings:
    0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT -> SMALL (soft-dotted)
    I, DOT
    0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted)
    I, DOT

    Do these classes resist (don't merge or split) with uppercase/titlecase or
    lowercase?

    (1.1) 0131; lower=0131 ; upper/title=0131

    (1.2) 0049; lower=0069 ; upper/title=0049
    (1.2) 0069; lower=0069 ; upper/title=0049

    (1.3) 0130; lower=0130 ; upper/title=0130

    (1.4) 0049 0307; lower=0069 0307; upper/title=0049 0307
    (1.4) 0069 0307; lower=0069 0307; upper/title=0049 0307

    OK, there's no merge, so no problem with Simple CaseFolding, which resist to
    case mappings.

    ------------------------------------
    (2) Mappings for Turkic CaseFolding:
    ------------------------------------
    (2.1) First class of source strings:
    0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I
    0049; T; 0131; # CAPITAL (dotless) I -> SMALL DOTLESS I
    (2.2) Second class of source strings:
    0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted)
    I
    0130; T; 0069; # CAPITAL I WITH DOT -> SMALL (soft-dotted)
    I
    (2.3) Third class of source strings:
    0049 0307; T; 0131 0307; # CAPITAL (dotless) I, DOT -> SMALL DOTLESS I,
    DOT
    (2.4) Fourth class of source strings:
    0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted)
    I, DOT

    Do these classes resist (don't merge or split) with common
    uppercase/titlecase or lowercase mappings?

    (2.1) 0131; C; lower=0131 ; upper/title=0131

    (2.1) 0049; C; lower=0069 ; upper/title=0049
    (2.2) 0069; C; lower=0069 ; upper/title=0049

    (2.2) 0130; C; lower=0130 ; upper/title=0130

    (2.3) 0049 0307; C; lower=0069 0307; upper/title=0049 0307
    (2.4) 0069 0307; C; lower=0069 0307; upper/title=0049 0307

    Problem here: uppercase mappings do not follow case folding rules.
    We would also need Turkic-specific mappings for upper/title case:

    (2.1) 0131; T; upper/title=0049
    (2.1) 0049; C; upper/title=0049

    (2.2) 0069; T; upper/title=0130
    (2.2) 0130; C; upper/title=0130

    (2.3) 0049 0307; T; upper/title=0049 0307 (=0130 ?)

    (2.4) 0069 0307; T; upper/title=0130 0307 (=0130 ?)

    But we would need then to define canonical equivalence between 0130 and 0049
    0307 and 0130 0307 to preserve canonical equivalence... So Turkic
    CaseFoldings would be broken, unless we say that Turkish texts should NOT be
    encoded with 0307, but only with 0049, 0069, 0130 or 0131. So Turkic
    CaseFolding rules should also avoid generating any 0307, whose behavior is
    not clear.

    If we just remove any 0307 from the Turkic texts, there is absolutely no
    problem with Turkic CaseFolding, provided that we also define
    Turkic-specific uppercase mappings as done above, and don't use the default
    locale-neutral uppercase mappings of the UCD.

    ------------------------------------
    (3) Mappings for Full CaseFolding:
    ------------------------------------
    (3.1) First class of source strings:
    0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I
    (3.2) Second class of source strings:
    0049; C; 0069; # CAPITAL (dotless) I -> SMALL (soft-dotted)
    I
    0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted)
    I
    (3.3) Third class of source strings:
    0130; F; 0069 0307; # CAPITAL I WITH DOT -> SMALL (soft-dotted)
    I, DOT
    0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT -> SMALL (soft-dotted)
    I, DOT
    0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted)
    I, DOT

    Do these classes resist (don't merge or split) with common
    uppercase/titlecase or lowercase mappings?

    (3.1) 0131; C; lower=0131 ; upper/title=0131

    (3.2) 0049; C; lower=0069 ; upper/title=0049
    (3.2) 0069; C; lower=0069 ; upper/title=0049

    (3.3) 0130; C; lower=0130 ; upper/title=0130

    (3.3) 0049 0307; C; lower=0069 0307; upper/title=0049 0307
    (3.3) 0069 0307; C; lower=0069 0307; upper/title=0049 0307

    Here the Full CaseFolding rules seems to be broken as they don't resist to
    uppercase mappings.
    There's only one way where they would be valid, only if uppercase mappings
    where also altered, so that the uppercase of 0130 (which is already
    uppercase) is 0049 0307 (impossible to do as uppercase mappings in the UCD
    are restricted to 1 character).

    The only remaining way to achieve it is to make them canonical equivalents
    to represent a uppercase dotted I. Thanks, we find this in the UCD, which
    defines exactly that canonical equivalence:

    0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN
    CAPITAL LETTER I DOT;;;0069;

    Good. Full CaseFolding are not broken, but they require the support of
    canonical equivalence of decompositions for dotted uppercase I. Using Full
    CaseMapping correctly requires being able to use normalization on its
    output.

    However care must be taken because Turkic case may have been converted in
    the past to uppercase, using Turkic rules, and this information is lost if
    language is not clearly identifiable.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Tue Dec 16 2003 - 18:59:16 EST