Cased and Case_Ignorable Not Disjoint
L2/09-355
2009-oct-21
Markus Scherer
Unicode 5.2 DerivedCoreProperties.txt provides data for the Cased and Case_Ignorable properties. These properties used to be defined in Unicode 5 chapter 3 as derived from other properties but without full lists of code points. (See definitions D135 & D136 on page 110 of Unicode 5.2.) It turns out that there are 117 characters in Unicode 5.2 (and most of them in earlier versions of Unicode) which have both the Cased and the Case_Ignorable property.
Aside from a logical contradiction, this makes it unclear what the Final_Sigma casing context means. (See table 3-15 in the standard.) The Final_Sigma casing context is defined as \p{cased}(\p{case-ignorable})* before the character (capital sigma) and the inverse and mirrored regular expression after it. Is the Final_Sigma casing context true or false when the capital sigma is preceded by a character that is both Cased and Case_Ignorable?
I propose changing the definition of Case_Ignorable to remove Cased characters, making these two properties disjoint. I also propose adding a Unicode regression test to make sure these two properties remain disjoint.
This is the set of 117 Unicode 5.2 characters which have both the Cased and the Case_Ignorable property:
02B0..02B8 # [9] (ʰ..ʸ) MODIFIER LETTER SMALL H..MODIFIER LETTER SMALL Y
02C0..02C1 # [2] (ˀ..ˁ) MODIFIER LETTER GLOTTAL STOP..MODIFIER LETTER REVERSED GLOTTAL STOP
02E0..02E4 # [5] (ˠ..ˤ) MODIFIER LETTER SMALL GAMMA..MODIFIER LETTER SMALL REVERSED GLOTTAL STOP
0345 # (ͅ) COMBINING GREEK YPOGEGRAMMENI
037A # (ͺ) GREEK YPOGEGRAMMENI
1D2C..1D61 # [54] (ᴬ..ᵡ) MODIFIER LETTER CAPITAL A..MODIFIER LETTER SMALL CHI
1D78 # (ᵸ) MODIFIER LETTER CYRILLIC EN
1D9B..1DBF # [37] (ᶛ..ᶿ) MODIFIER LETTER SMALL TURNED ALPHA..MODIFIER LETTER SMALL THETA
2090..2094 # [5] (ₐ..ₔ) LATIN SUBSCRIPT SMALL LETTER A..LATIN SUBSCRIPT SMALL LETTER SCHWA
2C7D # (ⱽ) MODIFIER LETTER CAPITAL V
A770 # (ꝰ) MODIFIER LETTER US