L2/05-076

Stability of Case Folding

M. Davis, 2005-02-10

We discussed in the last UTC meeting the issue of stability under case folding, especially in regards to caseless programming language identifiers and similar formats or processing (such as StringPrep). The issue is this: if we change the case folding behavior of assigned characters, that could cause a problem for implementations / specifications that need to maintain backwards compatibility. While these problems can be dealt with by the implementations / specifications, it would clearly simplify matters for them to be able to depend on stability.

The case foldings that are in Unicode have been reviewed extensively, so from that aspect there should be no problem in our adding a stability policy guaranteeing that they do not change. The open issue would be to examine characters that do not currently have case foldings, but could conceivably need to, if the other half of a case pair is added. Because the case folding is normally the toLowercase() value of a character, we can focus on only those characters that are either Uppercase or Titlecase. Because the Unicode recommendation for caseless identifiers recommends using NFKC (which Stringprep also follows), we only need focus on those characters that are in NFKC.

There are only a small number of such characters, the six below:

U+023A LATIN CAPITAL LETTER A WITH STROKE new in 4.1
U+023E LATIN CAPITAL LETTER T WITH DIAGONAL STROKE new in 4.1
U+03FD GREEK CAPITAL REVERSED LUNATE SIGMA SYMBOL new in 4.1
U+03FE GREEK CAPITAL DOTTED LUNATE SIGMA SYMBOL new in 4.1
U+03FF GREEK CAPITAL REVERSED DOTTED LUNATE SIGMA SYMBOL new in 4.1
U+04C0 CYRILLIC LETTER PALOCHKA already in 4.0

# Total code points: 6

Once we could resolve any issues in the above, and set in place a careful review for future cases, we could put in place a stability policy such as the following, for Unicode versions later than some version X.

D1. For all strings S containing characters only from Unicode Versions A and B

Of the above characters, here is a preliminary assessment:

U+023A LATIN CAPITAL LETTER A WITH STROKE new in 4.1
U+023E LATIN CAPITAL LETTER T WITH DIAGONAL STROKE new in 4.1

For these 2 characters, we should add corresponding lowercase characters ASAP, because there is a good chance that we will need them in the future.

U+03FD GREEK CAPITAL REVERSED LUNATE SIGMA SYMBOL new in 4.1
U+03FE GREEK CAPITAL DOTTED LUNATE SIGMA SYMBOL new in 4.1
U+03FF GREEK CAPITAL REVERSED DOTTED LUNATE SIGMA SYMBOL new in 4.1
U+04C0 CYRILLIC LETTER PALOCHKA already in 4.0

These 4 characters are and will remain caseless characters; we should change the general category to Lo to reflect that.

If this assessment is agreed to, then we could offer a slightly weaker stability guarantee during the interim period before we can add the two letters:

D1'. For all strings S containing characters only from Unicode Versions A and B (and excluding U+023A and U+023E)


B. For comparison, here are Uppercase or Titlecase characters which are not in NFKC, and don't have a casefolding.

03D2..03D4 # L& [3] GREEK UPSILON WITH HOOK SYMBOL..GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
2102 # L& DOUBLE-STRUCK CAPITAL C
2107 # L& EULER CONSTANT
210B..210D # L& [3] SCRIPT CAPITAL H..DOUBLE-STRUCK CAPITAL H
2110..2112 # L& [3] SCRIPT CAPITAL I..SCRIPT CAPITAL L
2115 # L& DOUBLE-STRUCK CAPITAL N
2119..211D # L& [5] DOUBLE-STRUCK CAPITAL P..DOUBLE-STRUCK CAPITAL R
2124 # L& DOUBLE-STRUCK CAPITAL Z
2128 # L& BLACK-LETTER CAPITAL Z
212C..212D # L& [2] SCRIPT CAPITAL B..BLACK-LETTER CAPITAL C
2130..2131 # L& [2] SCRIPT CAPITAL E..SCRIPT CAPITAL F
2133 # L& SCRIPT CAPITAL M
213E..213F # L& [2] DOUBLE-STRUCK CAPITAL GAMMA..DOUBLE-STRUCK CAPITAL PI
2145 # L& DOUBLE-STRUCK ITALIC CAPITAL D
1D400..1D419 # L& [26] MATHEMATICAL BOLD CAPITAL A..MATHEMATICAL BOLD CAPITAL Z
1D434..1D44D # L& [26] MATHEMATICAL ITALIC CAPITAL A..MATHEMATICAL ITALIC CAPITAL Z
1D468..1D481 # L& [26] MATHEMATICAL BOLD ITALIC CAPITAL A..MATHEMATICAL BOLD ITALIC CAPITAL Z
1D49C # L& MATHEMATICAL SCRIPT CAPITAL A
1D49E..1D49F # L& [2] MATHEMATICAL SCRIPT CAPITAL C..MATHEMATICAL SCRIPT CAPITAL D
1D4A2 # L& MATHEMATICAL SCRIPT CAPITAL G
1D4A5..1D4A6 # L& [2] MATHEMATICAL SCRIPT CAPITAL J..MATHEMATICAL SCRIPT CAPITAL K
1D4A9..1D4AC # L& [4] MATHEMATICAL SCRIPT CAPITAL N..MATHEMATICAL SCRIPT CAPITAL Q
1D4AE..1D4B5 # L& [8] MATHEMATICAL SCRIPT CAPITAL S..MATHEMATICAL SCRIPT CAPITAL Z
1D4D0..1D4E9 # L& [26] MATHEMATICAL BOLD SCRIPT CAPITAL A..MATHEMATICAL BOLD SCRIPT CAPITAL Z
1D504..1D505 # L& [2] MATHEMATICAL FRAKTUR CAPITAL A..MATHEMATICAL FRAKTUR CAPITAL B
1D507..1D50A # L& [4] MATHEMATICAL FRAKTUR CAPITAL D..MATHEMATICAL FRAKTUR CAPITAL G
1D50D..1D514 # L& [8] MATHEMATICAL FRAKTUR CAPITAL J..MATHEMATICAL FRAKTUR CAPITAL Q
1D516..1D51C # L& [7] MATHEMATICAL FRAKTUR CAPITAL S..MATHEMATICAL FRAKTUR CAPITAL Y
1D538..1D539 # L& [2] MATHEMATICAL DOUBLE-STRUCK CAPITAL A..MATHEMATICAL DOUBLE-STRUCK CAPITAL B
1D53B..1D53E # L& [4] MATHEMATICAL DOUBLE-STRUCK CAPITAL D..MATHEMATICAL DOUBLE-STRUCK CAPITAL G
1D540..1D544 # L& [5] MATHEMATICAL DOUBLE-STRUCK CAPITAL I..MATHEMATICAL DOUBLE-STRUCK CAPITAL M
1D546 # L& MATHEMATICAL DOUBLE-STRUCK CAPITAL O
1D54A..1D550 # L& [7] MATHEMATICAL DOUBLE-STRUCK CAPITAL S..MATHEMATICAL DOUBLE-STRUCK CAPITAL Y
1D56C..1D585 # L& [26] MATHEMATICAL BOLD FRAKTUR CAPITAL A..MATHEMATICAL BOLD FRAKTUR CAPITAL Z
1D5A0..1D5B9 # L& [26] MATHEMATICAL SANS-SERIF CAPITAL A..MATHEMATICAL SANS-SERIF CAPITAL Z
1D5D4..1D5ED # L& [26] MATHEMATICAL SANS-SERIF BOLD CAPITAL A..MATHEMATICAL SANS-SERIF BOLD CAPITAL Z
1D608..1D621 # L& [26] MATHEMATICAL SANS-SERIF ITALIC CAPITAL A..MATHEMATICAL SANS-SERIF ITALIC CAPITAL Z
1D63C..1D655 # L& [26] MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL A..MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL Z
1D670..1D689 # L& [26] MATHEMATICAL MONOSPACE CAPITAL A..MATHEMATICAL MONOSPACE CAPITAL Z
1D6A8..1D6C0 # L& [25] MATHEMATICAL BOLD CAPITAL ALPHA..MATHEMATICAL BOLD CAPITAL OMEGA
1D6E2..1D6FA # L& [25] MATHEMATICAL ITALIC CAPITAL ALPHA..MATHEMATICAL ITALIC CAPITAL OMEGA
1D71C..1D734 # L& [25] MATHEMATICAL BOLD ITALIC CAPITAL ALPHA..MATHEMATICAL BOLD ITALIC CAPITAL OMEGA
1D756..1D76E # L& [25] MATHEMATICAL SANS-SERIF BOLD CAPITAL ALPHA..MATHEMATICAL SANS-SERIF BOLD CAPITAL OMEGA
1D790..1D7A8 # L& [25] MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL ALPHA..MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL OMEGA

# Total code points: 470

C. For comparison, here are the remaining Uppercase/Titlecase characters, the ones that do have a case folding. To make the list shorter, a four-dot elipsis represents every other code points between the two values.

0041..005A # L& [26] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER Z
00C0..00D6 # L& [23] LATIN CAPITAL LETTER A WITH GRAVE..LATIN CAPITAL LETTER O WITH DIAERESIS
00D8..00DE # L& [7] LATIN CAPITAL LETTER O WITH STROKE..LATIN CAPITAL LETTER THORN
0100 # L& LATIN CAPITAL LETTER A WITH MACRON
....
0136 # L& LATIN CAPITAL LETTER K WITH CEDILLA
0139 # L& LATIN CAPITAL LETTER L WITH ACUTE
....
0147 # L& LATIN CAPITAL LETTER N WITH CARON
014A # L& LATIN CAPITAL LETTER ENG
....
0176 # L& LATIN CAPITAL LETTER Y WITH CIRCUMFLEX
0178..0179 # L& [2] LATIN CAPITAL LETTER Y WITH DIAERESIS..LATIN CAPITAL LETTER Z WITH ACUTE
017B # L& LATIN CAPITAL LETTER Z WITH DOT ABOVE
017D # L& LATIN CAPITAL LETTER Z WITH CARON
0181..0182 # L& [2] LATIN CAPITAL LETTER B WITH HOOK..LATIN CAPITAL LETTER B WITH TOPBAR
0184 # L& LATIN CAPITAL LETTER TONE SIX
0186..0187 # L& [2] LATIN CAPITAL LETTER OPEN O..LATIN CAPITAL LETTER C WITH HOOK
0189..018B # L& [3] LATIN CAPITAL LETTER AFRICAN D..LATIN CAPITAL LETTER D WITH TOPBAR
018E..0191 # L& [4] LATIN CAPITAL LETTER REVERSED E..LATIN CAPITAL LETTER F WITH HOOK
0193..0194 # L& [2] LATIN CAPITAL LETTER G WITH HOOK..LATIN CAPITAL LETTER GAMMA
0196..0198 # L& [3] LATIN CAPITAL LETTER IOTA..LATIN CAPITAL LETTER K WITH HOOK
019C..019D # L& [2] LATIN CAPITAL LETTER TURNED M..LATIN CAPITAL LETTER N WITH LEFT HOOK
019F..01A0 # L& [2] LATIN CAPITAL LETTER O WITH MIDDLE TILDE..LATIN CAPITAL LETTER O WITH HORN
01A2 # L& LATIN CAPITAL LETTER OI
01A4 # L& LATIN CAPITAL LETTER P WITH HOOK
01A6..01A7 # L& [2] LATIN LETTER YR..LATIN CAPITAL LETTER TONE TWO
01A9 # L& LATIN CAPITAL LETTER ESH
01AC # L& LATIN CAPITAL LETTER T WITH HOOK
01AE..01AF # L& [2] LATIN CAPITAL LETTER T WITH RETROFLEX HOOK..LATIN CAPITAL LETTER U WITH HORN
01B1..01B3 # L& [3] LATIN CAPITAL LETTER UPSILON..LATIN CAPITAL LETTER Y WITH HOOK
01B5 # L& LATIN CAPITAL LETTER Z WITH STROKE
01B7..01B8 # L& [2] LATIN CAPITAL LETTER EZH..LATIN CAPITAL LETTER EZH REVERSED
01BC # L& LATIN CAPITAL LETTER TONE FIVE
01C4..01C5 # L& [2] LATIN CAPITAL LETTER DZ WITH CARON..LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
01C7..01C8 # L& [2] LATIN CAPITAL LETTER LJ..LATIN CAPITAL LETTER L WITH SMALL LETTER J
01CA..01CB # L& [2] LATIN CAPITAL LETTER NJ..LATIN CAPITAL LETTER N WITH SMALL LETTER J
01CD # L& LATIN CAPITAL LETTER A WITH CARON
....
01DB # L& LATIN CAPITAL LETTER U WITH DIAERESIS AND GRAVE
01DE # L& LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
....
01EE # L& LATIN CAPITAL LETTER EZH WITH CARON
01F1..01F2 # L& [2] LATIN CAPITAL LETTER DZ..LATIN CAPITAL LETTER D WITH SMALL LETTER Z
01F4 # L& LATIN CAPITAL LETTER G WITH ACUTE
01F6..01F8 # L& [3] LATIN CAPITAL LETTER HWAIR..LATIN CAPITAL LETTER N WITH GRAVE
01FA # L& LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
....
0232 # L& LATIN CAPITAL LETTER Y WITH MACRON
023B # L& LATIN CAPITAL LETTER C WITH STROKE
023D # L& LATIN CAPITAL LETTER L WITH BAR
0241 # L& LATIN CAPITAL LETTER GLOTTAL STOP
0386 # L& GREEK CAPITAL LETTER ALPHA WITH TONOS
0388..038A # L& [3] GREEK CAPITAL LETTER EPSILON WITH TONOS..GREEK CAPITAL LETTER IOTA WITH TONOS
038C # L& GREEK CAPITAL LETTER OMICRON WITH TONOS
038E..038F # L& [2] GREEK CAPITAL LETTER UPSILON WITH TONOS..GREEK CAPITAL LETTER OMEGA WITH TONOS
0391..03A1 # L& [17] GREEK CAPITAL LETTER ALPHA..GREEK CAPITAL LETTER RHO
03A3..03AB # L& [9] GREEK CAPITAL LETTER SIGMA..GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA
03D8 # L& GREEK LETTER ARCHAIC KOPPA
....
03EE # L& COPTIC CAPITAL LETTER DEI
03F4 # L& GREEK CAPITAL THETA SYMBOL
03F7 # L& GREEK CAPITAL LETTER SHO
03F9..03FA # L& [2] GREEK CAPITAL LUNATE SIGMA SYMBOL..GREEK CAPITAL LETTER SAN
0400..042F # L& [48] CYRILLIC CAPITAL LETTER IE WITH GRAVE..CYRILLIC CAPITAL LETTER YA
0460 # L& CYRILLIC CAPITAL LETTER OMEGA
....
0480 # L& CYRILLIC CAPITAL LETTER KOPPA
048A # L& CYRILLIC CAPITAL LETTER SHORT I WITH TAIL
....
04BE # L& CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH DESCENDER
04C1 # L& CYRILLIC CAPITAL LETTER ZHE WITH BREVE
....
04CD # L& CYRILLIC CAPITAL LETTER EM WITH TAIL
04D0 # L& CYRILLIC CAPITAL LETTER A WITH BREVE
....
04F8 # L& CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
0500 # L& CYRILLIC CAPITAL LETTER KOMI DE
....
050E # L& CYRILLIC CAPITAL LETTER KOMI TJE
0531..0556 # L& [38] ARMENIAN CAPITAL LETTER AYB..ARMENIAN CAPITAL LETTER FEH
10A0..10C5 # L& [38] GEORGIAN CAPITAL LETTER AN..GEORGIAN CAPITAL LETTER HOE
1E00 # L& LATIN CAPITAL LETTER A WITH RING BELOW
....
1EF8 # L& LATIN CAPITAL LETTER Y WITH TILDE
1F08..1F0F # L& [8] GREEK CAPITAL LETTER ALPHA WITH PSILI..GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI
1F18..1F1D # L& [6] GREEK CAPITAL LETTER EPSILON WITH PSILI..GREEK CAPITAL LETTER EPSILON WITH DASIA AND OXIA
1F28..1F2F # L& [8] GREEK CAPITAL LETTER ETA WITH PSILI..GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI
1F38..1F3F # L& [8] GREEK CAPITAL LETTER IOTA WITH PSILI..GREEK CAPITAL LETTER IOTA WITH DASIA AND PERISPOMENI
1F48..1F4D # L& [6] GREEK CAPITAL LETTER OMICRON WITH PSILI..GREEK CAPITAL LETTER OMICRON WITH DASIA AND OXIA
1F59 # L& GREEK CAPITAL LETTER UPSILON WITH DASIA
....
1F5F # L& GREEK CAPITAL LETTER UPSILON WITH DASIA AND PERISPOMENI
1F68..1F6F # L& [8] GREEK CAPITAL LETTER OMEGA WITH PSILI..GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI
1F88..1F8F # L& [8] GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI..GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
1F98..1F9F # L& [8] GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI..GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
1FA8..1FAF # L& [8] GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI..GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
1FB8..1FBC # L& [5] GREEK CAPITAL LETTER ALPHA WITH VRACHY..GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
1FC8..1FCC # L& [5] GREEK CAPITAL LETTER EPSILON WITH VARIA..GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
1FD8..1FDB # L& [4] GREEK CAPITAL LETTER IOTA WITH VRACHY..GREEK CAPITAL LETTER IOTA WITH OXIA
1FE8..1FEC # L& [5] GREEK CAPITAL LETTER UPSILON WITH VRACHY..GREEK CAPITAL LETTER RHO WITH DASIA
1FF8..1FFC # L& [5] GREEK CAPITAL LETTER OMICRON WITH VARIA..GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
2126 # L& OHM SIGN
212A..212B # L& [2] KELVIN SIGN..ANGSTROM SIGN
2160..216F # Nl [16] ROMAN NUMERAL ONE..ROMAN NUMERAL ONE THOUSAND
24B6..24CF # So [26] CIRCLED LATIN CAPITAL LETTER A..CIRCLED LATIN CAPITAL LETTER Z
2C00..2C2E # L& [47] GLAGOLITIC CAPITAL LETTER AZU..GLAGOLITIC CAPITAL LETTER LATINATE MYSLITE
2C80 # L& COPTIC CAPITAL LETTER ALFA
....
2CE2 # L& COPTIC CAPITAL LETTER OLD NUBIAN WAU
FF21..FF3A # L& [26] FULLWIDTH LATIN CAPITAL LETTER A..FULLWIDTH LATIN CAPITAL LETTER Z
10400..10427 # L& [40] DESERET CAPITAL LETTER LONG I..DESERET CAPITAL LETTER EW

# Total code points: 893