L2/02-360
Re: | Property file changes (remainder after August meeting) |
From: | Mark Davis |
Date: | 2002-10-25 |
This document contains the remaining items from L2/02-267R3
that we did not finish in the August meeting. I am not going to bother making
everything pretty: the items we have already discussed are simply truncated and struck-through.
I believe the time has come to use UTF-8 consistently in all of our property data files. Currently Unihan.txt and NormalizationTest.txt are in UTF-8, a couple files are in Latin- 1, and most files are in ASCII. However, importantly:
This means that parsers that strip comments don't even need to know that the file is UTF-8 (unless they parse Unihan.txt); they can just treat it as ASCII. If we continue to follow these two principles, it makes the switchover almost unnoticeable. Initially, this would only matter in the few files that contain some Latin-1 non ASCII. Later, we could add real, readable annotations in comments to some of the files, e.g.:
00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
could become:
00DF; 00DF; 0053 0073; 0053 0053; # ß; ß; Ss; SS; LATIN SMALL LETTER SHARP
S;
0130; 0069 0307; 0130; 0130; # İ; i ̇; İ; İ; LATIN CAPITAL
LETTER I WITH DOT ABOVE
As a general rule, we should not have the fallback value for a property (the one that we give code points that are not explicitly mentioned) require computation; it should be a single value. Otherwise, it is too error-prone; too easy for programmers to make mistakes when processing the data files.
The way that the BIDI class property is handled is very error-prone. We say in UAX #9 that all unassigned code points are given the following values
Unfortunately, this is not repeated in UnicodeData-3.2.0.html (where the properties of UnicodeData.txt are documented). Nor are the relevant R and AL code points listed explicitly in DerivedBidiClass-3.2.0.txt. We should address both of these points: document the ranges in the ..html file, and add the code points to DerivedBidiClass.txt.
The Joining Type T is also not explicitly listed in ArabicShaping-3.2.0.txt. While in this case, at least the formula for computing T is included in the comments in the file, it would be less error-prone if they were listed explicitly. Those values are already given in DerivedJoiningType-3.2.0.txt.
The data file says:
# - Assigned characters that are not listed explicitly are given the value "N".
It omits telling what the default is for unassigned code points. I assume they are also N, in which case this needs to be changed to:
# - All code points that are not listed explicitly are given the value "N".
If they are not all N, then the ones that aren't should be explicitly listed!
The data file says:
# - Assigned characters that are not listed explicitly are given the value # "AL". # - Unassigned characters are given the value "XX".
The data file actually lists all the characters that are AL, and should. The above should be changed to:
# - All code points that are not listed explicitly are given the value "XX".
UnicodeData.html says: "This field is omitted if the titlecase is the same as field 12."
A user noted that "this is apparently not true, except for 01C5, 01C8, 01CB and 01F2." The data should consistently either omit or include the field (when the same as field 12), and the documentation should match.
The following text is in the rules:
The text should be clearer that it is reasonable (but not required) to use the regular expression, instead of the approximate rules. (It gives better results than the pairwise approach, and for regular-expression-based linebreak engines, is much easier to implement.)
However, the term "here" is imprecise. The normal interpretation would be that A ^ B iff A × B and A SP* × B. In that case, "here" means just "before the B". However, there are certain cases, such as ZW CL, where ZW SP × CL but ZW does break from CL. If the table is right, then the definition needs to be changed to A ^ B iff A × B and A SP* × B and A × SP* × B. I suggest the following clarification.
LB 12 Break after spaces
SP ÷
LB 13 Don’t break before or after NBSP or WORD JOINER
× GL
GL ×
The main purpose of WORD JOINER is exactly to prevent breaks where they would otherwise occur. The minimal fix is to change the ordering of LB 13, to move it to being LB 11b.
SG - Surrogates (XP) - (normative)
All characters with General Category Cs. There is no break between a high surrogate and a low surrogate....
Formally, each stable code point CP fulfills all the following conditions:
Example: In NFC, a-breve might satisfy all but (e), but if you add an ogonek it changes to a-ogonek + breve. So it is not stable. However, a-ogonek is stable in NFC, since it does satisfy (a-e).
There are pluses and minuses to adding these properties:
The downside is that the list of characters is rather large, so it bloats the size of the file.
For UTR #29: Text Boundaries, it would be useful to add some new properties once it is finalized (so probably in 4.0 instead of 3.2). That way people can use machine-readable properties instead of digging them out of TR text. The possibilities should be reviewed with the UTC review of the TR. Candidates include:
The tables below match the TR14 ordering, and use the notation described in the TR. The table is extended by also including SP..CB, and the values L, V, and T for Hangul Jamo. If your browser is enabled for tool-tips, then hovering over the cell reveals the Rule number that determines the breaking status in the case in question. Sometimes there are multiple rules, when a case has to be tested with and without intervening spaces. The differences between the two are marked in yellow.
This does not imply that the current layout of the table in the TR should be changed to be as large as the one below. The more complete table below is simply provided to illustrate the effects of the recommended changes.
OP | CL | QU | GL | NS | EX | SY | IS | PR | PO | NU | AL | ID | IN | HY | BA | BB | B2 | ZW | CM | SP | BK | CR | LF | CB | L | V | T | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OP | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ |
CL | _ | ^ | % | % | ^ | ^ | ^ | ^ | _ | % | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
QU | ^ | ^ | % | % | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | ^ | ^ | ^ | ^ | ^ | % | % | ^ | ^ |
GL | % | ^ | % | % | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | ^ | ^ | ^ | ^ | ^ | % | % | ^ | ^ |
NS | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
EX | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
SY | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | % | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
IS | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | % | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
PR | % | ^ | % | % | % | ^ | ^ | ^ | _ | _ | % | % | % | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | % | ^ | ^ |
PO | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
NU | _ | ^ | % | % | % | ^ | ^ | ^ | _ | % | % | % | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
AL | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | % | % | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
ID | _ | ^ | % | % | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
IN | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
HY | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
BA | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
BB | % | ^ | % | % | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | ^ | ^ | ^ | ^ | ^ | % | % | ^ | ^ |
B2 | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
ZW | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | ^ | _ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
CM | _ | ^ | % | % | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
SP | _ | ^ | _ | _ | _ | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
BK | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ |
CR | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | % | _ | _ | _ | _ |
LF | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ |
CB | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
L | _ | ^ | % | % | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | % | ^ | ^ |
V | _ | ^ | % | % | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
T | _ | ^ | % | % | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | ^ | ^ |
OP | CL | QU | GL | NS | EX | SY | IS | PR | PO | NU | AL | ID | IN | HY | BA | BB | B2 | ZW | CM | SP | BK | CR | LF | CB | L | V | T | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OP | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ |
CL | _ | ^ | % | ^ | ^ | ^ | ^ | ^ | _ | % | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
QU | ^ | ^ | % | ^ | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | ^ | ^ | ^ | ^ | ^ | % | % | % | % |
GL | % | ^ | % | ^ | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | ^ | ^ | ^ | ^ | ^ | % | % | % | % |
NS | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
EX | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
SY | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | % | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
IS | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | % | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
PR | % | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | % | % | % | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | % | % | % |
PO | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
NU | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | % | % | % | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
AL | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | % | % | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
ID | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
IN | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
HY | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | % | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
BA | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
BB | % | ^ | % | ^ | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | ^ | ^ | ^ | ^ | ^ | _ | % | % | % |
B2 | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
ZW | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | ^ | _ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
CM | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
SP | _ | ^ | _ | ^ | _ | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
BK | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ |
CR | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | % | _ | _ | _ | _ |
LF | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ |
CB | _ | ^ | % | ^ | _ | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | _ |
L | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | % | % | _ |
V | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | % | % |
T | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ | ^ | ^ | ^ | ^ | _ | _ | _ | % |