Case Mapping Definitions (was: Adding Lowercase Letters)

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Wed May 09 2007 - 02:55:38 CDT

  • Next message: Bob Hallissy: "Re: Font Samples"

    Philippe Verdy wrote on Tuesday, May 08, 2007 at 8:57 PM
    Subject: RE: Adding Lowercase Letters (was: Uppercase ß is coming? (U+1E9E))

    > The special casing rules for turkish do apply to the effect of case
    > mappings
    > to lowercase or to uppercase or to titlecase. But do they apply to the
    > case
    > folding (which is different from lowercase mapping)?

    I can't work out whether the Turkish rules are advisory or mandatory. The
    tables, unless you count the comments (and they appear to recommend
    non-conformance with subscript iota), are incomplete. The tables for
    Turkish do not respect canonical equivalence, as the comments caution. This
    said, CaseFolding.txt does address Turkish case folding.

    > I'd like also to find a precise reply to this question:
    > Are the strings resulting from a case mapping to uppercase (or to
    > lowercase,
    > or to titlecase) required to have the same case folding? Id est:

    > Are we guaranteed to have, with existing normative Unicode definitions and
    > stability rules, for every string S in a locale L, the following
    > equalities
    > starting at some current orpast version of the Unicode standard and in all
    > future versions:
    >
    > toCaseFold(toLowerCase(S, L), L)
    > = toCaseFold(toUpperCase(S, L), L)
    > = toCaseFold(toTitleCase(S, L), L)

    > Are there existing exceptions?

    Yes. U+0131 LATIN SMALL LETTER DOTLESS I lowercases and casefolds to
    itself, but uppercases and titlecases to U+0049 LATIN CAPITAL LETTER I,
    which then casefolds in the default casefolding to U+0069 LATIN SMALL LETTER
    I.

    U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE misbehaves similarly (mutatis
    mutandis) in the default simple mappings.

    > If so, are they bugs in the UCD to be corrected?

    The misbehaviour above is a deliberate choice.

    There does not appear to be a formal definition of case-folding for
    Lithuanian. The procedure for calculating case-folding given in TUS does
    not give perfect results. It does not really tell you that <U+0069, U+0307>
    should case fold to <U+0069>, and gives no hint on what to do with <U+0049,
    U+0307>.

    Richard.



    This archive was generated by hypermail 2.1.5 : Wed May 09 2007 - 02:58:56 CDT