L2/09-199


Title:  Casing Issue for Parenthesized Latin Letter Symbols

Author: Ken Whistler

Date:   May 7, 2009

Action: For Consideration by the UTC


Amongst the ARIB symbols now approved in Amendment 6
and hence soon to be published in Unicode 5.2 are a
set of parenthesized uppercase Latin letter symbols.
These pose an interesting problem for casing properties,
because they contrast with the long-encoded set of
parenthesized *lowercase* Latin letter symbols that
were encoded for compatibility with KS X 1001 (and Code
Page 949).

1F110;PARENTHESIZED LATIN CAPITAL LETTER A;So;0;L;<compat> 0028 0041 0029;;;;N;;;;;
...
1F129;PARENTHESIZED LATIN CAPITAL LETTER Z;So;0;L;<compat> 0028 005A 0029;;;;N;;;;;

contrasted with:

249C;PARENTHESIZED LATIN SMALL LETTER A;So;0;L;<compat> 0028 0061 0029;;;;N;;;;;
...
24B5;PARENTHESIZED LATIN SMALL LETTER Z;So;0;L;<compat> 0028 007A 0029;;;;N;;;;;

Here are the current, relevant property values for U+249C..U+24B5.

Simple case mappings: none

Case folding: default (maps to self)

Other_Lowercase  = False
Other_Alphabetic = False
Lowercase  = False
Alphabetic = False

The simplest way to handle the new characters U+1F110..U+1F129 would
be to treat them as opaque symbols, completely parallel to the
way U+249C..U+24B5 are currently handled. So they would also have
no simple case mappings, would case fold to themselves, and would
not be assigned Other_Uppercase or Other_Alphabetic values. Such
an approach would also be the least disruptive, since it would
disturb no existing properties or mappings for U+249C..U+24B5.

The downside, of course, is that handling U+1F110..U+1F129 ignores
the obvious fact (already observed by folks starting to review
the character properties for Unicode 5.2) that U+1F110..U+1F129
are the uppercase versions of U+249C..U+24B5.

So the high-level decision the UTC needs to take is:

1. Should U+1F110..U+1F129 be formally recognized (in terms of
   property assignments) as having a casing relationship to
   U+249C..U+24B5, or not.
   
If the answer to that question is no, then the property assignments
are easy, and the case relationship can be dealt with just as
a mention in the text and by annotation in the names list.

If the answer to that question is yes, then the details of the
property assignments have to also be determined. In particular:

2. Should the characters be *Cased*. I.e., do we assign
   Other_Lowercase and Other_Uppercase to them?
   
3. Should the characters have formal simple case mappings
   given to map each set to the other?
   
4. If simple case mappings are not given to the characters,
   should the two sets case fold, anyway? I.e., should the
   new uppercase symbols fold to the existing lowercase symbols?
   
Note that the handling of the Other_Alphabetic property would follow
from these other decisions, on the basis of property invariants.
In particular, the UTC already decided that symbols which come
in alphabetic casing pairs should also be treated as Alphabetic.
   
For precedent, consider also the *circled* Latin letter symbols:

24B6;CIRCLED LATIN CAPITAL LETTER A;So;0;L;<circle> 0041;;;;N;;;;24D0;
...
24CF;CIRCLED LATIN CAPITAL LETTER Z;So;0;L;<circle> 005A;;;;N;;;;24E9;

24D0;CIRCLED LATIN SMALL LETTER A;So;0;L;<circle> 0061;;;;N;;;24B6;;24B6
...
24E9;CIRCLED LATIN SMALL LETTER Z;So;0;L;<circle> 007A;;;;N;;;24CF;;24CF

The difference for those are:

   a. They were encoded together as a set a long time ago.
   b. They decompose to a *single* Latin letter, instead of a sequence.
   
Their properties are:

Simple case mappings: as shown, to each other

Case folding: fold to the lowercase

Other_Alphabetic = True
Other_Lowercase  = True (or Other_Uppercase = True)
Alphabetic = True
Lowercase  = True (or Uppercase = True)