You said:
> So ignore it and test whever the last symbols glues with ZWJ (it should,
> so there's no break in the reference implementation).
Which makes me think you misread the example I quoted. There is a break
in the reference implementation, though I argue (like you just did) that
there shouldn't be. So I think you agree with me and also think it's broken.
Otherwise, I'm not sure I fully understand what you are saying, but if
what you are saying is correct, then following the same logic, other
rules would fail, specifically:
÷ 0061 × 2060 × 0030 ÷ # ÷ [0.2] LATIN SMALL LETTER A (ALetter) ×
[4.0] WORD JOINER (Format_FE) × [9.0] DIGIT ZERO (Numeric) ÷ [0.3]
After the FE here there's no BREAK because:
ALetter Format Numeric -> ALetter Numeric
Which then following rule 9.0 is a no-break.
This is exactly the rule (4) as described in my previous email, just
with a different follow-up rule (9 instead of 3c). I don't see how rule
precedence would matter here, as there is no case for which two rules apply.
-- Tom. On 23/11/16 02:49, Philippe Verdy wrote: > IMHO, the ZWJ should glue with the last symbol following your examples. > But the combining diaeresis following the ZWJ extends it (even if in my > opinion it is "defective" and would likely display on a dotted ciurcle > in renderers, but not defective for the string definition of combining > sequences). > So ignore it and test whever the last symbols glues with ZWJ (it should, > so there's no break in the reference implementation). > > WB4: X (Extend | Format | ZWJ)*→X > > Extend: [ExtendGrapheme_Extend=Yes] This includes: > General_Category = Nonspacing_Mark (this includes the combining diaeresis) > General_Category = Enclosing_Mark > U+200C ZERO WIDTH NON-JOINER > plus a few General_Category = Spacing_Mark needed for canonical > equivalence. > > So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) → ZWJ > (EBG|Glue_After_Zwj) from rule WB4 eliminate the combining mark from the > input queue > > But rule WB3c comes before and prohibits it: > > WB3c: ZWJ × (Glue_After_Zwj | EBG) > > This means that you have first: > > ZWJ "COMBINING DIERESIS" GAZ → ZWJ × "COMBINING DIERESIS" EBG > > and this does not match the rule WB4 which is not matching for: > > X × (Extend | Format | ZWJ)*→X > > (it cannot remove the extenders if there's a no-break before them, it is > valid only when the break oppotunity is still unspecified. As soon as a > rule as produced a "break here" or "nobreak here" at a given position, > you must advance after this position (the rules are based on a small > finite state machine). So after : > > ZWJ "COMBINING DIERESIS" GAZ → ZWJ × "COMBINING DIERESIS" EBG > > it just remains in your input queue: > > "COMBINING DIERESIS" EBG (because "ZWJ ×" is already processed, and so > ZWJ is elminated) > > Now comes WB4: X (Extend | Format | ZWJ)* → X > > There's no more any "X" to match before the combining diaeresis: your > input queue starts by the combining diareasis matching "X", the > following character (EBG) does not match within "(Extend | Format | > ZWJ)*" (which matches an empty string and does not contain the combining > diaresis already matched in "X"), rule WB4 has then no replacement > effect and preserves the initial "X" (i.e. the combining diaeresis) > > . > > > > > > > > 2016-11-22 13:07 GMT+01:00 Tom Hacohen <tom_at_osg.samsung.com > <mailto:tom_at_osg.samsung.com>>: > > Dear, > > I recently updated libunibreak[1] according to unicode 9.0.0. I > thought I implemented it correctly, however it fails against two of > the tests in the reference test data: > > ÷ 200D × 0308 ÷ 2764 ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0] > COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART > (Glue_After_Zwj) ÷ [0.3] > > and > > ÷ 200D × 0308 ÷ 1F466 ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × > [4.0] COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3] > > > More specifically, it fails in both after the "combining diaeresis". > My implementation marks it as a break, whereas the test data as not. > The reference implementation, as expected, agrees with the test data. > > > However, looking at the test case and the UAX[2], this does not look > correct. More specifically, because of rule 4: > ZWJ Extended GAZ -> ZWJ GAZ > And then according to rule 3c, there should be no break opportunity > between them. The reference implementation, however, uses rule 999 > here, which I believe is incorrect. > > > Am I missing anything, or is this an issue with the reference test > data and reference implementation? > > Thanks, > Tom. > > [1]: https://github.com/adah1972/libunibreak > <https://github.com/adah1972/libunibreak> > [2]: http://www.unicode.org/reports/tr29/#WB1 > <http://www.unicode.org/reports/tr29/#WB1> > >Received on Wed Nov 23 2016 - 03:13:51 CST
This archive was generated by hypermail 2.2.0 : Wed Nov 23 2016 - 03:13:51 CST