Re: UAX #29: Ambiguities in WB4, and contributing back testcases

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 22 Dec 2016 21:08:22 +0000

On Wed, 21 Dec 2016 15:24:21 -0800
Manish Goregaokar <manish_at_mozilla.com> wrote:

> Aside from that, WB4's[6] greediness is underspecified. In previous
> versions, the rule was
<snip>

> However, now the rule is
>
> > X (Extend | Format | ZWJ)* → X
>
> The problem here is that ZWJ appears in the previous rule as well,
> WB3c[7]:
>
> > ZWJ × (Glue_After_Zwj | EBG)
>
> which says that we should not break between a ZWJ and a GAZ ("Glue
> After ZWJ") character.
>
> WB3c has precedence over WB4, which means that a sequence like
> `Emoji_Base ZWJ EBG` becomes `Emoji_Base ZWJ×EBG` *first*, before the
> ZWJ is collapsed into the Emoji_Base. This is fine.
>
> However, more complicated sequences depend on the greediness of the
> Kleene star in WB4. For example, take the sequence `Emoji_Base Extend
> ZWJ Extend EBG`. WB3c does not apply here. However, WB4 can apply
> since we have a Extend/ZWJ sequence.
>
> WB4 can apply in multiple ways. If it is applied greedily, we get
> `Emoji_Base(..) EBG` (where ellipses are used to denote WB4-collapsed
> characters). This does break since you don't break between Emoji_Base
> and EBG.
>
> However, we can apply it conservatively instead. We can get
> `Emoji_Base(..) ZWJ(..) EBG`, which does satisfy WB3c, and doesn't
> collapse.

From your terminology, I think you have an error in your transformation
to a 'regular' expression. Why don't you have the same problem when
you determine word breaks in

CR Extend LF?

I'm guessing that you have some mechanism that makes WB3 (CR × LF)
redundant. Rule WB3c does *not* transform to

ZWJ(...) × (Glue_After_Zwj | EBG)

Naively, I would say that WB4 can be reapplied to `Emoji_Base(..)
ZWJ(..) EBG`, yielding `Emoji_Base EBG` and thus a word break.

Richard.
Received on Thu Dec 22 2016 - 15:08:22 CST

This archive was generated by hypermail 2.2.0 : Thu Dec 22 2016 - 15:09:03 CST