Mark Davis, 2004-07-23
Latest Version: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/utc/technical_report_recommendations.html
The following are recommended changes for certain technical reports in the U4.1 timeframe. The proposal is to authorize the posting of Proposed Updates incorporating the following changes between now and November, to allow more time for public feedback.
Needs updating for U4.1 to incorporate the corrigendum, moving identifier section to TR31 (a stub will be left to point to it), plus editing to have definitions more consistent with #23 and #30. The following have been reported as problems; they need to be reviewed and fixed if so.
This section describes the relationship of normalization to respecting (or preserving) canonical equivalence. A process (or function) respects canonical equivalence when canonical equivalent inputs always produce canonically equivalent outputs. For functions that map strings to strings, this is often called preserving canonical equivalence. There are a number of important aspects to this concept:
1. We should modify word selection so that it has the same 'escape hatch' as line break, for Thai/Lao. It would thus be parallel to Line Break's LB 1, and add the character classes that are described there.
LB 1 Assign a line breaking class to each character of the input. Resolve AI, CB, SA, SG, XX into other line breaking classes depending on criteria outside the scope of this algorithm.
2. In a related matter, we need to incorporate the LineBreak corrigendum into LineBreak, and modify the TR to remove LB 7a.
LB 7a In all of the following rules, if a space is the base character for a combining mark, the space is changed to type ID. In other words, break before SP CM* in the same cases as one would break before an ID.
And document that NBSP is the preferred base character for showing combining marks in isolation.
3. There are other "special" rules in LineBreak.
LB 6 Don’t break a Korean Syllable Block, and treat it as a single unit of the same LB class as a Hangul Syllable in all the following rules
Treat a Korean Syllable block as if it were ID
LB 7b Don't break a combining character sequence and treat it as if it has the LB class of the base character in all of the following rules.
Treat X CM* as if it were X
These rules in general are difficult for regular expression implementations and for pair tables. They complicate regular expressions because they affect every instance where any characters could match; they complicate pair tables since they require prehandling in code, outside of the pair table. If they are present, they should be the top rules, since they should be 'handled' by changes all down the line. Both of the rules in LineBreak end up (because of other rules) having the same effect as the rules used in TR29: to treat a grapheme cluster as the base. This is because of the effect of other rules in UAX#14, that keep the differences between LineBreak=CM and Grapheme_Extend = true from surfacing. However, I have gotten the feedback from our implementers that it would be ideal they were unified, if UAX#14 had, instead of these two rules, the one rule used in #29:
Treat a grapheme cluster as if it were a single character: the first character of the cluster.
"I also think that it would help make things simpler if this grapheme cluster rule could be put as close to the top of the list of rules as possible, so that we don't have some rules looking within grapheme clusters, and others only at the boundaries. It would have to be a bug if a line break were to break a grapheme cluster, so it should be possible to say, from the very beginning, that line break rules work on grapheme cluster boundaries, and be done with any further consideration of combining marks (except for unattached ones)"
4. Failing adoption of #3 by the committee, LB6 should be replaced by ordinary rules. It only has an effect in 5 rules:
ID × IN
ID × PO
PR × ID
ALL ÷
÷ ALL
We can safely replace it by adding the following rules. They can go anyplace before the ALL rules, and can be put in logical locations. The first three rules correspond to the first three above; the last three disallow breaking in the middle of a Hangul Syllable (as described in Chapter 3).
L | V | T | LV | LVT × IN
L | V | T | LV | LVT × PO
PR × L | V | T | LV | LVT
L × L | V | LV | LVT
V | LV × V | T
T | LVT × T
As Asmus noted off-line, a common tailoring is to change Hangul Syllables to AL, but because sequences of AL don't divide either, it is safe to add the above rules: the tailoring just changes all of L | V | T | LV | LVT to AL.
5. Line Break Rule LB 18B can be dropped altogether. It has no effect on the results; anything that it would break will also be broken by LB20.
HY ÷
÷BB
6. Deborah identified some cases that are missed if the regular expression for numbers is used, rather than the list of pairs of rule 18:
Original LB18: PR ? ( OP | HY ) ? NU (NU | IS) * CL ? PO ?
Updated: PR ? ( OP | HY ) ? NU (NU | IS | SY ) * CL ? PO ?
PR × AL
PR × ID
There were no changes if the rules are being used, only if the big regular expression is being used as an alternative to the rules.
Needs updating for U4.1 for to account for changes in foldings, properties, names, scripts, and also the implications of Pattern_Whitespace and Pattern_Syntax. Other items:
I have scanned through a number of the TRs, and found other useful definitions that we should centralize so that they can be used consistently. Asmus may have already incorporated some these into his draft for the meeting; if so, skip over those that are. This is not a request that these definitions be added verbatim; they may need wordsmithing and changes for consistency.
isIdentifier(S)
implies isIdentifier(toNFC(S))
If the input is guaranteed to be in NFD, then Step 1 is simpler. Additional rules do not have to be generated; instead, the matching part of the rule just needs to be transformed into NFD. Thus instead of generating new rules, one simply replaces a rule like:
<A-acute> <dot_under> -> Z
with:
A <dot_under> <acute> -> Z
Step 2 will be the same; however, it will need to be applied to fewer cases, since fewer rules will result from Step 1.
The above two steps ensure that the folding preserves canonical equivalence. However, they do not guarantee that the folding preserves normalization. If normalization is required, then it must be applied as an additional step. This is typically an issue whenever the result of a rule contains combining marks. If normalization is to be applied after the each rule is applied, there are implementation techniques described in [Normalization] for ways to optimize this process. However, if there are any sizable number of changes, it is more efficient -- and certainly simpler -- to simply normalize the entire text once all of the rules have been applied.
These are copied from email on property topics: