From: George W Gerrity (g.gerrity@gwg-associates.com.au)
Date: Mon Jun 12 2006 - 03:41:00 CDT
On 2006-06-09, at 05:00, Richard Wordingham wrote:
> There appear to be bugs in the definition of the case-folding
> function toCasefold() as currently defined by http://
> www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt and
> Section 3.13 of TUS 4.1.0. (I am using the latter as I cannot find
> a more reliable draft of Section 3.13 in TUS 5.0.) This matters,
> for toCasefold() of NFKC strings valid in Unicode 5.0 is about to
> be frozen forever. Should these faults be made permanent?
>
> I have found two groups of NFKC grapheme clusters which fail to
> match their default uppercasings after conversion to NFD in one of
> the important 'case-insensitive' matching methods. I haven't
> reported these problems formally yet - I'd like to see what other
> people think first. It's conceivable that I'm the only person
> bothered by them.
>
> *Problem 1*
>
> The first is: <U+0131 LATIN SMALL LETTER DOTLESS I>
>
> The problem with this only occurs when using the default mappings.
> A different can of worms opens up for Turkic locales - I don't know
> whether the behaviour is fully defined for Turkic locales. This
> grapheme cluster is in all four normalised forms. According to
> http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing-5.0.0d13.txt
> and http://www.unicode.org/Public/5.0.0/ucd/
> UnicodeData-5.0.0d11.txt , its uppercasing (in all locales) is
>
> <U+0049 LATIN CAPITAL LETTER I>
>
> which is in all four normal forms.
>
> To compare these strings for 'canonical caseless matches', one
> calculates NFD(toCasefold(NFD())) of the strings. By http://
> www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt , their
> default casefoldings, whether simple or full, are <U+0131> and <U
> +0069 LATIN SMALL LETTER I>. These are not canonically
> equivalent. QED.
>
> Incidentally, the definition of default casefolding contradicts the
> definition of casefolding given in TUS 4.1.0 Section 5.18.
>
> There are two alternative solutions:
> (a) Remove the upper- and title-casings for U+0131 from
> UnicodeData.txt and uncomment out the Turkic data for U+0131 in
> SpecialCasing.dat, also making it apply to Azer(baijan)i.
> (b) Add two lines to SpecialCasing.dat:
>
> 0131; C; 0061; # LATIN SMALL LETTER DOTLESS I
> 0131; T; 0131; # LATIN SMALL LETTER DOTLESS I
Is it a legitimate solution to create a new codepoint for CAPITAL
DOTLESS I?
> *Problem 2*
>
> The second group is probably much less troublesome, but is quite
> awkward. There are two plausible NFC and NFKC sequences
>
> <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0306
> COMBINING BREVE>
>
> <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0304
> COMBINING MACRON>
>
> The former might occur if one were using a breve (or brachy) to
> explicitly mark the lack of stress in polytonic Modern Greek, for
> example in explaining the meter of poetry. The second might occur
> if one decided to redundantly mark Classical Greek vowel length -
> the macron is redundant, for the subscript iota implies that the
> vowel is long. I don't have any examples of these combinations. I
> will work with the latter.
>
> Converted to NFD, it yields
>
> <03B1 GREEK SMALL LETTER ALPHA, U+0304, 0345 COMBINING GREEK
> YPOGEGRAMMENI>
>
> The default uppercasing (not uncontroversial, but that's a
> linguistic matter) is
> <0391 GREEK CAPITAL LETTER ALPHA, U+0304, U+0399 GREEK CAPITAL
> LETTER IOTA>, whose NFC and NFKC form is
>
> <1FB9 GREEK CAPITAL LETTER ALPHA WITH MACRON, U+0399 GREEK CAPITAL
> LETTER IOTA>
>
> Now, the case-insensitive match whose outcome is guaranteed to be
> stable under the case-folding stability policy (http://
> www.unicode.org/standard/stability_policy.html) is given by
> toCasefold() of NFKC strings.
>
> Now, toCasefold of the starting point, <U+1FB3, U+0304>, is <U+03B1
> GREEK SMALL LETTER ALPHA, U+03B9 GREEK SMALL LETTER IOTA, U+0304>,
> while toCasefold of <1FB9, U+0399> is <U+1FB1 GREEK SMALL LETTER
> ALPHA WITH MACRON, U+03B9>. But the casefolded forms are not
> canonically equivalent!
>
> The problem here is that the definition of toCasefold() offers no
> hint that when U+0345 COMBINING GREEK YPOGEGRAMMENI, which may be
> hidden in a precomposed form, is detached as U+03B9, it should be
> moved to after any immediately following characters of non-zero
> combining class (and characters that decompose solely to such - U
> +0F73 and U+0F75.) SpecialCasing.txt has at least an implication
> that such should be done when a U+0399 detaches itself, but I find
> it hard to read it as normative.
I read Church and NT Greek, but am no expert. However, it seems to me
that the way to solve the problem is to create a new codepoint, GREEK
CAPITAL LETTER ALPHA WITH YPOGEGRAMMENI, whose glyph is AI.
Lowercasing it would translate to alpha with hypogegrammeni.
> This type of problem does not occur with NF(K)D strings - the
> combining class of U+0345 forces it to the end of the cluster. It
> is for this reason that the formal definitions of canonical and
> compatibility caseless matches use NFD and NFKD respectively.
This archive was generated by hypermail 2.1.5 : Mon Jun 12 2006 - 03:55:49 CDT