L2/07-026
From: | Mark Davis |
Date: | 2007-01-14 |
Re: | Property and Value Alias Issues |
Eric and I have been looking at properties in connection with the XML work that Eric has been doing. In doing so, a number of items have come up. I've captured these below for discussion in the UTC.
# All code points not explicitly listed for Age # have the value unassigned. # @missing: 0000..10FFFF; unassigned
But we don't do that for the string values. Recommendations are in the Table 2 below. Generally results should be some name if it is a catalog-like property, "" (empty) if they are information about a string (such as the bmg), and # (the source character itself) if they are foldings (since unaffected characters should be left alone). This also needs to be applied to the Unihan provisional properties
cp=CE31, dm=<CE20 11B8>, not <110E 1173 11B8>
blk; n/a ; Arabic_Presentation_Forms-A => blk; n/a ; Arabic_Presentation_Forms_A; Arabic_Presentation_Forms-A
Note: an alternative is to just replace them, since we specify that name matching ignores case differences.
dt ; can ; Canonical => dt ; Can ; Canonical ; can
Note: an alternative is to just replace them, since we specify that name matching ignores those characters.
Note: The simple lowercase may be omitted in the data file if the lowercase is the same as the code point itself.
We need to document this for the other foldings:
cf ; Case_Folding dm ; Decomposition_Mapping FC_NFKC ; FC_NFKC_Closure lc ; Lowercase_Mapping scc ; Special_Case_Condition sfc ; Simple_Case_Folding tc ; Titlecase_Mapping uc ; Uppercase_Mapping
sfc ; Simple_Case_Folding => scf ; Simple_Case_Folding ; sfc
CaseFolding.txt
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0049; T; 0131; # LATIN CAPITAL LETTER I
SpecialCasing.txt
03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
...
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
Name | Rec. Regex for Allowable Values |
kCheungBauerIndex | /[0-9]{3}\.[0-9]{2}/ |
kFennIndex | /[1-9][0-9]{0,2}\.[01][0-9]/ |
kGSR | /[0-9]{4}[a-vx-z]'?/ |
kHDZRadBreak | /[\x{2F00}-\x{2FD5}]\[U\+2?[0-9A-F]{4}\]:[1-8][0-9]{4}\.[0-9]{2}[012]/ |
kIRGDaeJaweon | /([0-9]{4}\.[0-9]{2}[01])|(0000\.555)/ |
kPhonetic | /[1-9][0-9]{0,3}[A-D]?\*?/ |
kTang | /\*?[A-Za-z\(\)\x{E6}\x{251}\x{259}\x{25B}\x{300}\x{30C}]+/ |
Abbr | Name | Rec. Regex for Allowable Values | Rec. Value for Unlisted |
age | Age | /([0-9]+\.[0-9]|unassigned)/ | unassigned (already defined) |
nv | Numeric_Value | /-?[0-9]+\.[0-9]+/ | Nan |
blk | Block | /[a-zA-Z0-9]+([_\ ][a-zA-Z0-9]+)*/ | No_Block |
sc | Script | # | |
dm | Decomposition_Mapping | /[\x{0}-\x{10FFFF}]+/ | |
FC_NFKC | FC_NFKC_Closure | ||
cf | Case_Folding | /[\x{0}-\x{10FFFF}]+/ | |
lc | Lowercase_Mapping | ||
tc | Titlecase_Mapping | ||
uc | Uppercase_Mapping | ||
sfc | Simple_Case_Folding | /[\x{0}-\x{10FFFF}]/ | |
slc | Simple_Lowercase_Mapping | ||
stc | Simple_Titlecase_Mapping | ||
suc | Simple_Uppercase_Mapping | ||
bmg | Bidi_Mirroring_Glyph | /[\x{0}-\x{10FFFF}]?/ | "" |
isc | ISO_Comment | /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*|\ |
|
na1 | Unicode_1_Name | /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*(\ \((CR|FF|LF|NEL)\))?)?/ | <reserved> |
na | Name | /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*|\ |