UAX #29: Ambiguities in WB4, and contributing back testcases
richard.wordingham at ntlworld.com
Thu Dec 22 15:08:22 CST 2016
On Wed, 21 Dec 2016 15:24:21 -0800
Manish Goregaokar <manish at mozilla.com> wrote:
> Aside from that, WB4's greediness is underspecified. In previous
> versions, the rule was
> However, now the rule is
> > X (Extend | Format | ZWJ)* → X
> The problem here is that ZWJ appears in the previous rule as well,
> > ZWJ × (Glue_After_Zwj | EBG)
> which says that we should not break between a ZWJ and a GAZ ("Glue
> After ZWJ") character.
> WB3c has precedence over WB4, which means that a sequence like
> `Emoji_Base ZWJ EBG` becomes `Emoji_Base ZWJ×EBG` *first*, before the
> ZWJ is collapsed into the Emoji_Base. This is fine.
> However, more complicated sequences depend on the greediness of the
> Kleene star in WB4. For example, take the sequence `Emoji_Base Extend
> ZWJ Extend EBG`. WB3c does not apply here. However, WB4 can apply
> since we have a Extend/ZWJ sequence.
> WB4 can apply in multiple ways. If it is applied greedily, we get
> `Emoji_Base(..) EBG` (where ellipses are used to denote WB4-collapsed
> characters). This does break since you don't break between Emoji_Base
> and EBG.
> However, we can apply it conservatively instead. We can get
> `Emoji_Base(..) ZWJ(..) EBG`, which does satisfy WB3c, and doesn't
>From your terminology, I think you have an error in your transformation
to a 'regular' expression. Why don't you have the same problem when
you determine word breaks in
CR Extend LF?
I'm guessing that you have some mechanism that makes WB3 (CR × LF)
redundant. Rule WB3c does *not* transform to
ZWJ(...) × (Glue_After_Zwj | EBG)
Naively, I would say that WB4 can be reapplied to `Emoji_Base(..)
ZWJ(..) EBG`, yielding `Emoji_Base EBG` and thus a word break.
More information about the Unicode