Re: | Recommended changes to UAX #29, #14 |
From: | Mark Davis, Andy Heninger |
Date: | $Date: 2005/10/01 17:26:11 $ |
UAX #14 and #29 describe rules for detecting boundaries between text segments. As with all Unicode algorithms, implementations just need to get the same results; they don't have to follow the architecture. And for efficiency, most implementations use different mechanisms.
In CLDR, we recently introduced structure for tailoring boundaries. In so doing, we followed the rule structure of those UAXs as much as possible. One good advantage of that is that we now have monkey tests that can compare an implementation against an implementation that follows the rules precisely. That allows implementations to avoid gratuitous deviations.
However, we found a few problems in the rules themselves. This document proposes some changes in response.
In particular, the rules "treat X as Y" turn out to be difficult to exactly interpret precisely, and exposed a few edge cases. Take the following rules from Word boundaries:
Treat a grapheme cluster as if it were a single character: the first character of the cluster. Do not break within it. |
|||
GC |
→ |
FC | (3) |
Ignore trailing Format characters. That is, ignore Format characters in all subsequent rules (except the last rule). |
|||
X Format* | → | X | (4) |
To remind people, Format and Extend characters are: | |
Format | General_Category = Format (Cf) minus Joiner, Non-Joiner |
Extend | Me + Mn + Joiner, Non-Joiner, plus a few exceptional Mc characters:09BE BENGALI VOWEL SIGN AA 09D7 BENGALI AU LENGTH MARK ... |
Grapheme clusters do 3 things. They keep (a) CRLF together; they keep (b) Hangul syllables together; and keep (c) non-spacing sequences together with their bases (more exactly, Extend characters with their bases).
If the rules for the boundary conditions would otherwise keep Hangul syllables together (as is the case for our Word, Line, and Sentence breaks, then (3) is equivalent to
3a) CR × LF
3b) X Extend* -> X.
Line break already separates these out (and modifies its version of 3b). After looking at the interactions of these with other rules, we think that is the better approach in #29 also. It tends to expose some edge cases so that we can make the resolution clearer, as we will see below.
Section 6.2 explains how the rule X Extend* -> X is to be interpreted (http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules). That is, this rule means that in every subsequent rule, change each instance of X (where X is any type except the last after the break) into X Extend*. Thus:
X Y × Z W
becomes
X Extend* Y Extend* × Z Extend* W
This section needs some revisions for clarity, and to specifically mention the following points.
a. An expression like "¬(OLetter | Upper | Lower | Sep)" needs to be treated as a whole -- negation is limited to expressions that denote a set of characters -- thus it turns into "¬(OLetter | Upper | Lower | Sep) Extend*", not "¬(OLetter Extend* | Upper Extend* | Lower Extend* | Sep Extend*) "
b. The 'treat as' rule also means that the following is required:
× $Extend
That is, that you don't allow breaking within such sequences (in rules after the 'treat as' rule).
In Word Break, we would then have the following rules:
3a) CR × LF
3b) X Extend* -> X
4) X Format* -> X.
In an implementation according to section 6.2, rules 3b and 4 turn into the following rules, and turns each X in subsequent rules into (X Extend* Format*).
3b*) × $Extend
4*) × $Format
However, that means that extend and format characters are not completely ignorable. If you insert a format character between a character and an Extend character, it introduces a break into a word that had none, or if you insert an Extend character after a Format character, it also introduces a break into a word that had none. This seems counter-intuitive. Our recommendation is to combine these two rules into a new 4':
4') X (Extend | Format)* -> X
which essentially ignores both Extend and Format characters in words, wherever they occur. While this change is not necessary in Sentence break (since the default action is to keep characters together rather than break them apart), for parallelism we should make the change there too (it doesn't hurt anything).
Sentence break has the following rules:
Break after paragraph separators.
Sep ÷ (3) Treat a grapheme cluster as if it were a single character: the first character of the cluster. Do not break within it.
GC
→
FC (4) Ignore trailing Format characters. That is, ignore Format characters in all subsequent rules.
X Format* → X (5)
Now that we have broken down what GC->FC means, it is clear that this is an error, since it would break between CR and LF (both being members of Sep). Since we have separated out the CRLF rule above, now fixing this becomes easy:
3a) CR × LF
3b') Sep ÷
4') X (Extend | Format)* -> X
Sentence break has the following rules.
ATerm Close* Sp* × ( ¬(OLetter | Upper | Lower | Sep) )* Lower (8) Break after sentence terminators, but include closing punctuation, trailing spaces, and (optionally) a paragraph separator.
( STerm | ATerm ) Close* × ( Close | Sp | Sep ) (9) ( STerm | ATerm ) Close* Sp × ( Sp | Sep ) (10) ( STerm | ATerm ) Close* Sp* ÷ (11)
Rule #8 is part of a set dealing with ambiguous sentence terminators (like "."). There are cases where sequences of STerm or ATerm occur at the end of a sentence, so to cover that we should introduce the following.
8b) (STerm | ATerm) Close* Sp* × (STerm | ATerm)
Rules 9-11 are to capture the rule that you break after (but not within) the expression (( STerm | ATerm ) Close* Sp* Sep?), that is: a terminator, followed by optionally any number of close punctuation characters, followed optionally by any number of space characters, followed optionally by a paragraph separator character. Since the goal is to allow any number of Space characters, rule 10 needs a minor fix:
10) ( STerm | ATerm ) Close* Sp* × ( Sp | Sep )
The editorial committee recommends changing 'user character' to 'user-perceived character' in analogy / contrast to 'user-defined character'. This requires edits to UAX#29 and perhaps other places.
Linebreak has:
LB 1 Assign a line breaking class to each code point of the input. Resolve AI, CB, SA, SG, and XX into other line breaking classes depending on criteria outside the scope of this algorithm.
This is fine and good, but if the implementation doesn't handle these specially, they need to have defaults; they must be resolved to something for the rest of the algorithm to work, since all but CB don't have rules associated with them. Right now, you have to go rooting around in the text to figure that out, and people may make mistakes in doing so. So add to this rule something like the following text:
In the absence of such criteria, by default the classes AI, SA, SG, and XX are resolved to AL.
The following is formally editorial, since it does not change the results of the algorithm. However, it makes it much clearer, and easier to see how the customization would work.
Currently, we have the following rule:
In general, lines should not be broken inside numbers of the form described by the following regular expression:
PR ? ( OP | HY ) ? NU (NU | SY | IS) * CL ? PO ?
Examples: $(12.35) 2,1234 (12)¢ 12.54¢
The default line breaking algorithm approximates this with the following rule, together with PR × AL and PR × ID, which handle numeric prefix puncutation. Note that some cases are already handled above, like ‘9,’, ‘[9’. For a tailoring that supports the regular expression directly, see Section 8.2, Examples of Customization.
LB 18 Do not break between the following pairs of classes.
CL × PO
HY × NU
IS × NU
NU × NU
NU × PO
PR × AL
PR × HY
PR × ID
PR × NU
PR × OP
SY × NU
This makes it actually a bit clumsy to customize, since PR × AL and PR × ID are mixed in, even though they don't have anything really do with the numbers. Recommend removing those from rule 18, and moving them into a new rule 17b:
17b. Do not break prefix signs (such as currency) from letters
In the Customization Example 6, include precisely the syntax that 18 is replaced by.
18* Do not break numbers
PR | × |
( OP | HY )? NU |
( OP | HY ) | × |
NU |
NU | × |
(NU | SY | IS) |
NU (NU | SY | IS)* | × |
(NU | SY | IS | CL) |
NU (NU | SY | IS)* CL? | × |
PO |
To maintain rule numbers over versions, we have introduced notation like 18b, 18c,... The goal is to maintain stability for rules, so that references can be made to, say, Rule 17 of UAX #14, without having to cite the exact version, because the reference wants to be general, not tied to a specific reference. If we simply renumbered each time we introduced a rule, that would make such references difficult and clumsy.
However, when we put this into CLDR, it turned out to be easier to use decimal notation to refer to rules. The main advantage is that it allows for arbitrary insertion between existing numbers. Thus, if we had 18.1 and 18.2 (instead of 18b and 18c), then we could later introduce 18.15, whereas we currently have no mechanism for inserting a new rule between 18b and 18c. We might think about that for the text as well.