L2/09-199 Title: Casing Issue for Parenthesized Latin Letter Symbols Author: Ken Whistler Date: May 7, 2009 Action: For Consideration by the UTC Amongst the ARIB symbols now approved in Amendment 6 and hence soon to be published in Unicode 5.2 are a set of parenthesized uppercase Latin letter symbols. These pose an interesting problem for casing properties, because they contrast with the long-encoded set of parenthesized *lowercase* Latin letter symbols that were encoded for compatibility with KS X 1001 (and Code Page 949). 1F110;PARENTHESIZED LATIN CAPITAL LETTER A;So;0;L; 0028 0041 0029;;;;N;;;;; ... 1F129;PARENTHESIZED LATIN CAPITAL LETTER Z;So;0;L; 0028 005A 0029;;;;N;;;;; contrasted with: 249C;PARENTHESIZED LATIN SMALL LETTER A;So;0;L; 0028 0061 0029;;;;N;;;;; ... 24B5;PARENTHESIZED LATIN SMALL LETTER Z;So;0;L; 0028 007A 0029;;;;N;;;;; Here are the current, relevant property values for U+249C..U+24B5. Simple case mappings: none Case folding: default (maps to self) Other_Lowercase = False Other_Alphabetic = False Lowercase = False Alphabetic = False The simplest way to handle the new characters U+1F110..U+1F129 would be to treat them as opaque symbols, completely parallel to the way U+249C..U+24B5 are currently handled. So they would also have no simple case mappings, would case fold to themselves, and would not be assigned Other_Uppercase or Other_Alphabetic values. Such an approach would also be the least disruptive, since it would disturb no existing properties or mappings for U+249C..U+24B5. The downside, of course, is that handling U+1F110..U+1F129 ignores the obvious fact (already observed by folks starting to review the character properties for Unicode 5.2) that U+1F110..U+1F129 are the uppercase versions of U+249C..U+24B5. So the high-level decision the UTC needs to take is: 1. Should U+1F110..U+1F129 be formally recognized (in terms of property assignments) as having a casing relationship to U+249C..U+24B5, or not. If the answer to that question is no, then the property assignments are easy, and the case relationship can be dealt with just as a mention in the text and by annotation in the names list. If the answer to that question is yes, then the details of the property assignments have to also be determined. In particular: 2. Should the characters be *Cased*. I.e., do we assign Other_Lowercase and Other_Uppercase to them? 3. Should the characters have formal simple case mappings given to map each set to the other? 4. If simple case mappings are not given to the characters, should the two sets case fold, anyway? I.e., should the new uppercase symbols fold to the existing lowercase symbols? Note that the handling of the Other_Alphabetic property would follow from these other decisions, on the basis of property invariants. In particular, the UTC already decided that symbols which come in alphabetic casing pairs should also be treated as Alphabetic. For precedent, consider also the *circled* Latin letter symbols: 24B6;CIRCLED LATIN CAPITAL LETTER A;So;0;L; 0041;;;;N;;;;24D0; ... 24CF;CIRCLED LATIN CAPITAL LETTER Z;So;0;L; 005A;;;;N;;;;24E9; 24D0;CIRCLED LATIN SMALL LETTER A;So;0;L; 0061;;;;N;;;24B6;;24B6 ... 24E9;CIRCLED LATIN SMALL LETTER Z;So;0;L; 007A;;;;N;;;24CF;;24CF The difference for those are: a. They were encoded together as a set a long time ago. b. They decompose to a *single* Latin letter, instead of a sequence. Their properties are: Simple case mappings: as shown, to each other Case folding: fold to the lowercase Other_Alphabetic = True Other_Lowercase = True (or Other_Uppercase = True) Alphabetic = True Lowercase = True (or Uppercase = True)