L2/03-063
Re: | Case-Sensitive Characters |
From: | Mark Davis |
Date: | 2003-02-19 |
When doing case-insensitive matching and other casing operations, it turns out to be very useful to have an internal property which contains exactly those characters that are either the source of a case mapping or in the target of a case mapping. When doing any casing operations, you can then ignore all characters that do not have that property, for a significant performance win. For the purpose of this document, I'll call such a property (and associated set of characters) Case_Sensitive.
In Unicode, we define a character to be Cased if it is either Uppercase, Lowercase, or Titlecase according to the UCD. In an ideal world, these two properties would be the same. As it turns out, however, there are characters that are Case_Sensitive but not Cased, and characters that are Cased, but not Case_Sensitive. The latter, while formally Cased characters, really function as if they are uncased (Lo) in terms of all operations.
I am not asking for any action from the UTC, but this is related to how we provide a definition of cased characters, so thought it would be worth bringing an comparison of the differences to the committee's attention. The differences are also worth examining just in case there happen to be cases where we should have case mappings (or add characters to map to).
U+00AA # FEMININE ORDINAL INDICATOR U+00BA # MASCULINE ORDINAL INDICATOR U+0138 # LATIN SMALL LETTER KRA U+0180 # LATIN SMALL LETTER B WITH STROKE U+018D # LATIN SMALL LETTER TURNED DELTA U+019A..U+019B # LATIN SMALL LETTER L WITH BAR..LATIN SMALL LETTER LAMBDA WITH STROKE U+01AA..U+01AB # LATIN LETTER REVERSED ESH LOOP..LATIN SMALL LETTER T WITH PALATAL HOOK U+01BA # LATIN SMALL LETTER EZH WITH TAIL U+01BE # LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE U+0221 # LATIN SMALL LETTER D WITH CURL U+0234..U+0236 # LATIN SMALL LETTER L WITH CURL..LATIN SMALL LETTER T WITH CURL U+0250..U+0252 # LATIN SMALL LETTER TURNED A..LATIN SMALL LETTER TURNED ALPHA U+0255 # LATIN SMALL LETTER C WITH CURL U+0258 # LATIN SMALL LETTER REVERSED E U+025A # LATIN SMALL LETTER SCHWA WITH HOOK U+025C..U+025F # LATIN SMALL LETTER REVERSED OPEN E..LATIN SMALL LETTER DOTLESS J WITH STROKE U+0261..U+0262 # LATIN SMALL LETTER SCRIPT G..LATIN LETTER SMALL CAPITAL G U+0264..U+0267 # LATIN SMALL LETTER RAMS HORN..LATIN SMALL LETTER HENG WITH HOOK U+026A..U+026E # LATIN LETTER SMALL CAPITAL I..LATIN SMALL LETTER LEZH U+0270..U+0271 # LATIN SMALL LETTER TURNED M WITH LONG LEG..LATIN SMALL LETTER M WITH HOOK U+0273..U+0274 # LATIN SMALL LETTER N WITH RETROFLEX HOOK..LATIN LETTER SMALL CAPITAL N U+0276..U+027F # LATIN LETTER SMALL CAPITAL OE..LATIN SMALL LETTER REVERSED R WITH FISHHOOK U+0281..U+0282 # LATIN LETTER SMALL CAPITAL INVERTED R..LATIN SMALL LETTER S WITH HOOK U+0284..U+0287 # LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK..LATIN SMALL LETTER TURNED T U+0289 # LATIN SMALL LETTER U BAR U+028C..U+0291 # LATIN SMALL LETTER TURNED V..LATIN SMALL LETTER Z WITH CURL U+0293..U+02B8 # LATIN SMALL LETTER EZH WITH CURL..MODIFIER LETTER SMALL Y U+02C0..U+02C1 # MODIFIER LETTER GLOTTAL STOP..MODIFIER LETTER REVERSED GLOTTAL STOP U+02E0..U+02E4 # MODIFIER LETTER SMALL GAMMA..MODIFIER LETTER SMALL REVERSED GLOTTAL STOP U+037A # GREEK YPOGEGRAMMENI U+03D2..U+03D4 # GREEK UPSILON WITH HOOK SYMBOL..GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL U+03D7 # GREEK KAI SYMBOL U+03F3 # GREEK LETTER YOT U+04C0 # CYRILLIC LETTER PALOCHKA U+10A0..U+10C5 # GEORGIAN CAPITAL LETTER AN..GEORGIAN CAPITAL LETTER HOE U+1D00..U+1D6B # LATIN LETTER SMALL CAPITAL A..LATIN SMALL LETTER UE U+2071 # SUPERSCRIPT LATIN SMALL LETTER I U+207F # SUPERSCRIPT LATIN SMALL LETTER N U+2102 # DOUBLE-STRUCK CAPITAL C U+2107 # EULER CONSTANT U+210A..U+2113 # SCRIPT SMALL G..SCRIPT SMALL L U+2115 # DOUBLE-STRUCK CAPITAL N U+2119..U+211D # DOUBLE-STRUCK CAPITAL P..DOUBLE-STRUCK CAPITAL R U+2124 # DOUBLE-STRUCK CAPITAL Z U+2128 # BLACK-LETTER CAPITAL Z U+212C..U+212D # SCRIPT CAPITAL B..BLACK-LETTER CAPITAL C U+212F..U+2131 # SCRIPT SMALL E..SCRIPT CAPITAL F U+2133..U+2134 # SCRIPT CAPITAL M..SCRIPT SMALL O U+2139 # INFORMATION SOURCE U+213D..U+213F # DOUBLE-STRUCK SMALL GAMMA..DOUBLE-STRUCK CAPITAL PI U+2145..U+2149 # DOUBLE-STRUCK ITALIC CAPITAL D..DOUBLE-STRUCK ITALIC SMALL J U+1D400..U+1D454 # MATHEMATICAL BOLD CAPITAL A..MATHEMATICAL ITALIC SMALL G U+1D456..U+1D49C # MATHEMATICAL ITALIC SMALL I..MATHEMATICAL SCRIPT CAPITAL A U+1D49E..U+1D49F # MATHEMATICAL SCRIPT CAPITAL C..MATHEMATICAL SCRIPT CAPITAL D U+1D4A2 # MATHEMATICAL SCRIPT CAPITAL G U+1D4A5..U+1D4A6 # MATHEMATICAL SCRIPT CAPITAL J..MATHEMATICAL SCRIPT CAPITAL K U+1D4A9..U+1D4AC # MATHEMATICAL SCRIPT CAPITAL N..MATHEMATICAL SCRIPT CAPITAL Q U+1D4AE..U+1D4B9 # MATHEMATICAL SCRIPT CAPITAL S..MATHEMATICAL SCRIPT SMALL D U+1D4BB # MATHEMATICAL SCRIPT SMALL F U+1D4BD..U+1D4C3 # MATHEMATICAL SCRIPT SMALL H..MATHEMATICAL SCRIPT SMALL N U+1D4C5..U+1D505 # MATHEMATICAL SCRIPT SMALL P..MATHEMATICAL FRAKTUR CAPITAL B U+1D507..U+1D50A # MATHEMATICAL FRAKTUR CAPITAL D..MATHEMATICAL FRAKTUR CAPITAL G U+1D50D..U+1D514 # MATHEMATICAL FRAKTUR CAPITAL J..MATHEMATICAL FRAKTUR CAPITAL Q U+1D516..U+1D51C # MATHEMATICAL FRAKTUR CAPITAL S..MATHEMATICAL FRAKTUR CAPITAL Y U+1D51E..U+1D539 # MATHEMATICAL FRAKTUR SMALL A..MATHEMATICAL DOUBLE-STRUCK CAPITAL B U+1D53B..U+1D53E # MATHEMATICAL DOUBLE-STRUCK CAPITAL D..MATHEMATICAL DOUBLE-STRUCK CAPITAL G U+1D540..U+1D544 # MATHEMATICAL DOUBLE-STRUCK CAPITAL I..MATHEMATICAL DOUBLE-STRUCK CAPITAL M U+1D546 # MATHEMATICAL DOUBLE-STRUCK CAPITAL O U+1D54A..U+1D550 # MATHEMATICAL DOUBLE-STRUCK CAPITAL S..MATHEMATICAL DOUBLE-STRUCK CAPITAL Y U+1D552..U+1D6A3 # MATHEMATICAL DOUBLE-STRUCK SMALL A..MATHEMATICAL MONOSPACE SMALL Z U+1D6A8..U+1D6C0 # MATHEMATICAL BOLD CAPITAL ALPHA..MATHEMATICAL BOLD CAPITAL OMEGA U+1D6C2..U+1D6DA # MATHEMATICAL BOLD SMALL ALPHA..MATHEMATICAL BOLD SMALL OMEGA U+1D6DC..U+1D6FA # MATHEMATICAL BOLD EPSILON SYMBOL..MATHEMATICAL ITALIC CAPITAL OMEGA U+1D6FC..U+1D714 # MATHEMATICAL ITALIC SMALL ALPHA..MATHEMATICAL ITALIC SMALL OMEGA U+1D716..U+1D734 # MATHEMATICAL ITALIC EPSILON SYMBOL..MATHEMATICAL BOLD ITALIC CAPITAL OMEGA U+1D736..U+1D74E # MATHEMATICAL BOLD ITALIC SMALL ALPHA..MATHEMATICAL BOLD ITALIC SMALL OMEGA U+1D750..U+1D76E # MATHEMATICAL BOLD ITALIC EPSILON SYMBOL..MATHEMATICAL SANS-SERIF BOLD CAPITAL OMEGA U+1D770..U+1D788 # MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA..MATHEMATICAL SANS-SERIF BOLD SMALL OMEGA U+1D78A..U+1D7A8 # MATHEMATICAL SANS-SERIF BOLD EPSILON SYMBOL..MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL OMEGA U+1D7AA..U+1D7C2 # MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA..MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL OMEGA U+1D7C4..U+1D7C9 # MATHEMATICAL SANS-SERIF BOLD ITALIC EPSILON SYMBOL..MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL
U+02BC # MODIFIER LETTER APOSTROPHE U+02BE # MODIFIER LETTER RIGHT HALF RING U+0300..U+0301 # COMBINING GRAVE ACCENT..COMBINING ACUTE ACCENT U+0307..U+0308 # COMBINING DOT ABOVE..COMBINING DIAERESIS U+030A # COMBINING RING ABOVE U+030C # COMBINING CARON U+0313 # COMBINING COMMA ABOVE U+0331 # COMBINING MACRON BELOW U+0342 # COMBINING GREEK PERISPOMENI
The above might seem puzzling. Here is a log of why each of them is in Case_Sensitive (basically due to defective case pairs, mostly Greek, where a character has to be decomposed to express the case mapping). The log only contains the first entry that would cause a character to be added.
Adding [\u0307] because of: U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE => U+0069 LATIN SMALL LETTER I, U+0307 COMBINING DOT ABOVE
Adding [\u0149\u02BC] because of: U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE => U+02BC MODIFIER LETTER APOSTROPHE, U+004E LATIN CAPITAL LETTER N
Adding [\u01F0\u030C] because of: U+01F0 LATIN SMALL LETTER J WITH CARON => U+004A LATIN CAPITAL LETTER J, U+030C COMBINING CARON
Adding [\u0301\u0308\u0390] because of: U+0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS => U+0399 GREEK CAPITAL LETTER IOTA, U+0308 COMBINING DIAERESIS, U+0301 COMBINING ACUTE ACCENT
Adding [\u0331\u1E96] because of: U+1E96 LATIN SMALL LETTER H WITH LINE BELOW => U+0048 LATIN CAPITAL LETTER H, U+0331 COMBINING MACRON BELOW
Adding [\u030A\u1E98] because of: U+1E98 LATIN SMALL LETTER W WITH RING ABOVE => U+0057 LATIN CAPITAL LETTER W, U+030A COMBINING RING ABOVE
Adding [\u02BE\u1E9A] because of: U+1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING => U+0041 LATIN CAPITAL LETTER A, U+02BE MODIFIER LETTER RIGHT HALF RING
Adding [\u0313\u1F50] because of: U+1F50 GREEK SMALL LETTER UPSILON WITH PSILI => U+03A5 GREEK CAPITAL LETTER UPSILON, U+0313 COMBINING COMMA ABOVE
Adding [\u0300\u1F52] because of: U+1F52 GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA => U+03A5 GREEK CAPITAL LETTER UPSILON, U+0313 COMBINING COMMA ABOVE, U+0300 COMBINING GRAVE ACCENT
Adding [\u0342\u1F56] because of: U+1F56 GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI => U+03A5 GREEK CAPITAL LETTER UPSILON, U+0313 COMBINING COMMA ABOVE, U+0342 COMBINING GREEK PERISPOMENI
[A-Za-z\u00B5\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u0137\u0139-\u017F\u0181-\u018C\u018E-\u0199 \u019C-\u01A9\u01AC-\u01B9\u01BC-\u01BD\u01BF\u01C4-\u0220\u0222-\u0233\u0253-\u0254\u0256-\u0257 \u0259\u025B\u0260\u0263\u0268-\u0269\u026F\u0272\u0275\u0280\u0283\u0288\u028A-\u028B\u0292\u0345 \u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03CE\u03D0-\u03D1\u03D5-\u03D6\u03D8-\u03F2 \u03F4-\u03F5\u03F7-\u03FB\u0400-\u0481\u048A-\u04BF\u04C1-\u04CE\u04D0-\u04F5\u04F8-\u04F9 \u0500-\u050F\u0531-\u0556\u0561-\u0587\u1E00-\u1E9B\u1EA0-\u1EF9\u1F00-\u1F15\u1F18-\u1F1D \u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC \u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC \u2126\u212A-\u212B\u2160-\u217F\u24B6-\u24E9\uFB00-\uFB06\uFB13-\uFB17\uFF21-\uFF3A\uFF41-\uFF5A \U00010400-\U0001044F]