UAX #29: Ambiguities in WB4, and contributing back testcases from Manish Goregaokar on 2016-12-21 (Unicode Mail List Archive)

From: Manish Goregaokar <manish_at_mozilla.com>
Date: Wed, 21 Dec 2016 15:24:21 -0800

Hi,

We've been implementing[1] the Unicode 9 version of UAX #29[2] in
Rust, and came across some ambiguities and issues.

One issue is that the tests[3] are a bit lacking. They don't handle
cases with multiple flag emoji, for example (the handling of which
changed since Unicode 8). We have a couple testcases[4][5] for these
things (and may create more), is there any way to contribute these
back?

Aside from that, WB4's[6] greediness is underspecified. In previous
versions, the rule was

> X (Extend | Format)* → X

which means that you can "collapse" proceeding extend/format
characters into a character itself, without changing the state you're
in. This would just work because Extend/Format characters only appear
in this rule.

However, now the rule is

> X (Extend | Format | ZWJ)* → X

The problem here is that ZWJ appears in the previous rule as well, WB3c[7]:

> ZWJ × (Glue_After_Zwj | EBG)

which says that we should not break between a ZWJ and a GAZ ("Glue
After ZWJ") character.

WB3c has precedence over WB4, which means that a sequence like
`Emoji_Base ZWJ EBG` becomes `Emoji_Base ZWJ×EBG` *first*, before the
ZWJ is collapsed into the Emoji_Base. This is fine.

However, more complicated sequences depend on the greediness of the
Kleene star in WB4. For example, take the sequence `Emoji_Base Extend
ZWJ Extend EBG`. WB3c does not apply here. However, WB4 can apply
since we have a Extend/ZWJ sequence.

WB4 can apply in multiple ways. If it is applied greedily, we get
`Emoji_Base(..) EBG` (where ellipses are used to denote WB4-collapsed
characters). This does break since you don't break between Emoji_Base
and EBG.

However, we can apply it conservatively instead. We can get
`Emoji_Base(..) ZWJ(..) EBG`, which does satisfy WB3c, and doesn't
collapse.

These aren't really sequences that will occur in practice (I think?),
but I think it's important that implementations don't differ in their
behavior and segment things differently. If we don't actually care
about this, I think this ambiguity should at least be called out
explicitly in the spec.

WB4 makes the word break steps loop in on themselves. Previously you
just had to pattern match each interval between code points with the
rules in order, which can be done in any order and produce the same
result. Now that there's a replacement rule which changes the
structure of the string, the algorithm is suddenly dependent on the
order and fashion in which WB4 is applied.

Could this be clarified?

Thanks,

-Manish Goregaokar

[1]: https://github.com/unicode-rs/unicode-segmentation/pull/10
[2]: http://www.unicode.org/reports/tr29/ (permalink:
http://www.unicode.org/reports/tr29/tr29-29.html)
[3]: http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.txt
[4]: https://github.com/unicode-rs/unicode-segmentation/blob/8bac7c72ddd70426acfe1e58545cdd1694c61d88/src/test.rs#L94
[5]: https://github.com/unicode-rs/unicode-segmentation/blob/8bac7c72ddd70426acfe1e58545cdd1694c61d88/src/test.rs#L19
[6]: http://www.unicode.org/reports/tr29/#WB4
[7]: http://www.unicode.org/reports/tr29/#WB3c

-Manish
Received on Wed Dec 21 2016 - 18:08:30 CST

This archive was generated by hypermail 2.2.0 : Wed Dec 21 2016 - 18:08:30 CST