UAX #29 and WB4 from Daniel Bünzli via Unicode on 2020-03-04 (Unicode Mail List Archive)

From: Daniel Bünzli via Unicode <unicode_at_unicode.org>
Date: Wed, 4 Mar 2020 18:01:25 +0100

Hello,

My implementation of word break chokes only on the following test case from the file [1]:

÷ 0020 × 0308 ÷ 0020 ÷ # ÷ [0.2] SPACE (WSegSpace) × [4.0] COMBINING DIAERESIS (Extend_FE) ÷ [999.0] SPACE (WSegSpace) ÷ [0.3]

I find:

÷ 0020 × 0308 × 0020 ÷

Basically my implementation uses WB4 to rewrite the first two characters to WSegSpace and then applies WB3ad resulting in the non-break between 0308 and 0020.

Re-reading the text I suspect I should not restart the rules from the first one when a WB4 rewrite occurs but only apply the subsequent rules. Is that correct ?

Best,

Daniel

[1]: https://unicode.org/Public/13.0.0/ucd/auxiliary/WordBreakTest.txt
Received on Wed Mar 04 2020 - 11:27:12 CST

This archive was generated by hypermail 2.2.0 : Wed Mar 04 2020 - 11:27:13 CST