Submitted for UTC consideration by: Asmus Freytag
September 22, 2009
Issue a corrigendum to UAX#14, applicable from version 3.0.0, changing rule LB8 from
LB8. Break after zero-width space.
ZW ÷
to
LB8. Break before character following a zero-width space, even if one or more spaces intervene.
ZW SP* ÷
Note, the line break classes in this rule are ZW, which is the ZWSP character and SP, which is the SPACE character; both are single character line break classes. As usual, the character ÷ means a break is allowed in that location. * means "zero or more".
For the effect of the corrigendum and why it is
needed see the following background information.
This has
three consequences:
1) As Eric reported:
Consider the input ZW CL, and let's determine if there is a break before the CL. Rule LB8 provides the answer: ZW + CL.Consider the input ZW SP CL, still for the position before the CL. This time, LB13 provides the answer: ZW SP x CL.
The same applies if one replaces CL by EX, IS or SY (the other things equivalent wrt LB13) or WJ (handled similarly in LB11).
Note that line break rules are always invoked in ascending order. Therefore, Rule LB7 (x SP) prevents a break before the SP with higher priority than LB8 (ZW ÷) allows a break after the ZW. ( ÷ means break allowed, x means, no break).
As currently written, these rules would mean that adding a SP to a string would *remove* a line break opportunity in this example. That's completely counterintuitive, and clearly a bug in the specification.
2) As Eric further noted, "that mechanism [i.e. SP removing a line break opportunity] does not exist [in the pair table implementation], hence the example pair table does not implement the rules."
As far as I know, this is the only unreconciled difference between the pair table and the rules, and it's not by design. The intent is to have these two agree with each other, and also to have the ZW always result in a break opportunity.
3) Creating the corrigendum as proposed would allow any implementations that have followed the pair table to claim conformance to the proper version of UAX#14 with the corrigendum. This is especially useful as what they implement, which matches the proposed behavior, is more in line with user expectations.
A simple addition of SP* to LB8 would reconcile the rules with the pair table and at the same time replace the strange, counterintuitive behavior by something more regular and in keeping with the rest of the design.
Change LB8 from ZW ÷ to ZW SP * ÷
There are conformance and practical implications.
The practical implications are limited, because the current rules, applied literally, exhibit counterintuitive behavior, that furthermore occurs only in rather contrived contexts. Such behavior is rather unlikely to be an outcome deliberately desired by a document author. ZW is typically applied between letters of some sort, not in front of space characters that are followed by closing or terminal punctuation. Whenever it is applied , the intent is to cause a break, which the current rules don't allow.
The formal conformance implication is that all existing implementations, from 3.0, that were based on the pair table, while doing the "right thing", are formally non-conformant. Further, such implementations can't be cheaply made conformant, because the pair table can't express the concept of "add a space to remove a line break opportunity" without redesign of the driver code and table architecture.
A corrigendum gives these existing implementation a formal conformance target.
For rule based, and regex based implementations, implementing the proposed fix means a localized change.
The proposed corrigendum changes LB8 from ZW ÷ to ZW SP* ÷ and it's necessary to investigate the interactions of all the other rules up to LB18, which is the one that handles all other breaks after SP. That rule (LB18) is SP +, so there's no interaction with either new or old LB8. (See the appendix below on "how to verify interaction between rules").
The rules where there are interactions are the ones cited by Eric, LB11 and LB13. Those are the only ones with higher priority than LB18 where there is a leading "x" in the rule (for example x CL or x WJ). Those two rules, LB13 and LB11, in interaction with LB7 describe the contexts that should be affected by this change, so that interaction is by design.
All other rules are unaffected. Either they occur below SP ÷, or they don't start with 'x'.
Move LB8
before LB7 (that is renumber to LB6a).
This option is inferior on three important counts.
First: it would allow break opportunity before a SP
character. This would add a new design element, because spaces are otherwise
elided when they occur at a line break opportunity. The only way to break a
line before a space is by using a hard line
break. However, hard line breaks are not break opportunities, but
mandatory breaks, which break a line no
matter whether it would fit the width of the margins.
With ZW, you do get not a hard break (which exists always), but a break opportunity (which only manifests itself when you need to wrap the line there).
With the alternative option, there would be instances where lines wrap and where the second line inexplicably starts with a space, or run of spaces, just because there's a ZW. This sounds like a cool "feature" but it really goes against the whole tradition and rationale for line wrapping.
Lines are broken, because they don't fit the margins. When a line has spaces at the line break point, the spaces are elided (as if they were removed, or left hanging over the margin invisibly). The new line starts with the first non-space character. The alternative option would introduce fundamentally new behavior, not because it's needed, but merely to fix an arcane bug in the rules.
Second: It violates the bug fixing equivalent of Occam's razor. The bug is that "adding space, removes a line break opportunity" in a few, limited contexts. There's no need to suddenly support entirely new break opportunities, as in sequences like:
ZW ÷ SP ÷ ZW ÷ SP + ZW
Third: As proposed at the top of this document, the new rule can be implemented by the pair table. In fact, has been implemented by the pair table since 3.0. Changing LB8 so it becomes ZW SP * ÷ and issuing a corrigendum would bring both specifications (rules and table) into alignment. In contrast, the alternative cannot be realized with the pair table without making some substantial addition to the pair table architecture.
The reason for that limitation is that the pair table is based on the underlying design concept of always eliding spaces at the line breaks. (It is more than likely that any other implementation architecture that handles SP explicitly as a special case, would be adversely affected by any reordering of LB7 and LB8.)
Conclusion
After investigating the
proposed bug and alternative options, and including a detailed discussion
with Eric Muller, Andy Heninger, and Mark Davis, I recommend the
corrigendum proposed above.
Appendix: How to verify
interaction between rules
A rule ending in ÷ overrides any later rule
starting with x, but can't effect earlier rules or later rules that
are of the form B x A or B + A, or of the form B+. Rule Among rules
LB9
- LB17
there are two that start with x. Those are the ones Eric gave in
his
bug report. They are
LB11 (x WJ), and rule
LB13, which is
x CL
x EX
etc. For any class C in either of these rules,
we now have (in 5.2.0) ZW ÷ C but also ZW x
SP x C. The latter is the part that is counterintuitive and should
be fixed. All other rules and character classes are
unaffected by the proposal.