Eric and I have been looking at properties in connection with the XML work
that Eric has been doing. In doing so, a number of items have come up. I've
captured these below for discussion in the UTC.
- The regex in the Unihan descriptions are useful for testing. Eric has
noted, however, that they need some fixing: see Table 1 below. Two other items:
- The regex notation in Unihan.html and Unihan.txt should use a standard regex notation for codepoint
literals, such as Perl: \x{...}
- There is an error with kFourCornerCode for U+6F5E, "3716. 3716.4"
- For non-enumerated regular properties, it would be useful to have those
as well, perhaps in PropertyValueAliases.txt. Table 2 has a draft set for
discussion. Recommend adding regex to UCD.html as patterns for use in
testing, after review by ed committee.
- We need to be more explicit about some of the string values, since
reasonable people can differ in interpretation currently. One issue is what
the value of the property is for code points not listed, such as unassigned
code points. For other
properties, we now document that in the data files, such as in DerivedAge.txt:
# All code points not explicitly listed for Age
# have the value unassigned.
# @missing: 0000..10FFFF; unassigned
But we don't do that for the string values. Recommendations are in the
Table 2 below: proposed is to document in UCD.html and PropertyAliases.txt.
Generally results should be some name if it is a catalog-like
property, "" (empty) if they are information about a string (such as the bmg),
and # (the source character itself) if they are foldings (since unaffected
characters should be left alone). This also needs to be applied to the
Unihan provisional properties.
- We do not make clear in PropertyValueAliases.txt what the default
notation is for booleans. Eric chose N/Y on the pattern of NFD_Quick_Check,
while I'd been using F/T. We should document whatever we choose in PropertyValueAliases.txt (and
probably Eric's choice is the better one). Note that this places no
requirement on APIs; it is just the format we choose for relaying
information. Document in UCD.html and PropertyAliases.txt the use of N/Y
for booleans, and the convention of using the presence or absence of the
property in listings in .txt files.
- The Jamo property was not done for 5.0, as per the following action. It
should be fixed in the next version. This needs no action from the UTC,
since we already have an action to do it.
- [106-C20]
Consensus: Document the Jamo_Short_Name property as a "contributory"
property for Unicode 5.0 in UCD.html, PropertyAliases.txt and
PropertyValueAliases.txt. Ref
L2/05-379R.
- We need to document in UCD.html and in the text of U5.0.1 that the algorithmic decomposition mapping values
for Hangul syllables are not the full ones but the pair-wise ones. These
correspond to all the other decomposition mappings for NFC. Example:
cp=CE31, dm=<CE20 11B8>, not <110E 1173 11B8>
- Eric found a problem in CompositionExclusions in a comment: "if you look
at the character count for pile #3, it says 924. I believe it should be
1030. If you just add the four largest ranges, you already get more than
924: 542+106+59+270 = 977." Fixed by Ken.
- The intention for the canonicalized block names in
http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt
is
for them to be suitable for use as identifiers. But some of them have "-" in
the name. The proposal is to make the old value an alias and add the fixed
new value. Here is an example of the change.
blk; n/a ; Arabic_Presentation_Forms-A
=>
blk; n/a ; Arabic_Presentation_Forms_A; Arabic_Presentation_Forms-A
Note: an alternative is to just replace them, since we specify that name
matching ignores case differences.
- The canonical names in
http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt are
all title or uppercase, except for Decomposition_Type (dt). It would be more
uniform if we fixed them. Here is an example of the change.
dt ; can ; Canonical
=>
dt ; Can ; Canonical ; can
Note: an alternative is to just replace them, since we specify that name
matching ignores those characters.
- The values of the String properties need to be better documented
regarding blank values in the source files. Where the value in the
UnicodeData is
blank, that indicates that the code point maps to itself. Thus the Lowercase_Mapping("a") = "a",
not the empty string. We document this for the
case of the simple lower/title/uppercase mappings (as below, from UCD.html):
Note: The simple lowercase may be omitted in the data file if the
lowercase is the same as the code point itself.
We need to document this for the other foldings:
cf ; Case_Folding
dm ; Decomposition_Mapping
FC_NFKC ; FC_NFKC_Closure
lc ; Lowercase_Mapping
scc ; Special_Case_Condition
sfc ; Simple_Case_Folding
tc ; Titlecase_Mapping
uc ; Uppercase_Mapping
- The abbreviation sfc for Simple_Case_Folding has two letters reversed.
Thus it should be fixed to:
sfc ; Simple_Case_Folding
=>
scf ; Simple_Case_Folding ; sfc
- The numeric values given in
http://unicode.org/Public/UNIDATA/extracted/DerivedNumericValues.txt are
in decimal format (eg for U+00BD nv="0.5"), while the format in UCD.html is
rational numbers (eg "1/2"). We should consider fixing this lack of
synchrony, probably by changing the
DerivedNumericValues.txt format.
- The scc / Special_Case_Condition property is not really well defined in
terms of its values. The overall recommendation from the ed committee is
that this be retracted as a property, and that following information be
characterized in UCD.html and in the XML version as "conditional casing
data" instead of a formal property:
CaseFolding.txt
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0049; T; 0131; # LATIN CAPITAL LETTER I
SpecialCasing.txt
03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
...
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
- Script value Unknown
The topic came up during the meeting of the rules
for these values. I looked at them and here's what I found.
- For the script property, we currently say in the UAX that
script=Unknown (Zzzz) is {unassigned, private-use, and noncharacter}.
- What we actually have as property values includes all of those, but
adds surrogates, ie, {unassigned, private-use, surrogates, and
noncharacter}.
I suggest that we leave the property the it is, and change the
documentation.
Table 2. Proposed Regex and Unlisted values
Abbr |
Name |
Rec. Regex for Allowable Values for the listing of properties in
our data files |
Rec. Value for Unlisted |
age |
Age |
/([0-9]+\.[0-9]|unassigned)/ |
unassigned (already defined) |
nv |
Numeric_Value |
/-?[0-9]+\.[0-9]+/ NEEDS fixing for
fractions |
Nan |
blk |
Block |
/[a-zA-Z0-9]+([_\ ][a-zA-Z0-9]+)*/ |
No_Block (add Script - Unknown) |
sc |
Script |
The code point itself, but # can
be used to represent that in certain circumstances. |
dm |
Decomposition_Mapping |
/[\x{0}-\x{10FFFF}]+/ |
FC_NFKC |
FC_NFKC_Closure |
cf |
Case_Folding |
/[\x{0}-\x{10FFFF}]+/ |
lc |
Lowercase_Mapping |
tc |
Titlecase_Mapping |
uc |
Uppercase_Mapping |
sfc |
Simple_Case_Folding |
/[\x{0}-\x{10FFFF}]/ |
slc |
Simple_Lowercase_Mapping |
stc |
Simple_Titlecase_Mapping |
suc |
Simple_Uppercase_Mapping |
bmg |
Bidi_Mirroring_Glyph |
/[\x{0}-\x{10FFFF}]?/ |
"" |
isc |
ISO_Comment |
/([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*|\)?/
Asmus/Ken to supply actual value |
na1 |
Unicode_1_Name |
/([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*(\ \((CR|FF|LF|NEL)\))?)?/
look also at the angle brackets. |
"" for na1
null or empty should be the default in properties: in
display the following can be used:
<reserved>, <control>, <private-use>, <surrogate>, <noncharacter> |
na |
Name |
/([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*|\)?/ |