Additional Word Break Questions
Cameron Dutro via CLDR-Users
cldr-users at unicode.org
Thu Aug 17 11:21:13 CDT 2017
Just wanted to bump this thread since I haven't received any responses yet.
For the time being I've deleted the test cases in question from my test
suite, but I'd like to understand more, since I'm not convinced my
implementation is correct.
On Tue, Aug 8, 2017 at 6:06 PM Cameron Dutro <cameron at lumoslabs.com> wrote:
> Dear CLDR users,
> As you may recall I emailed this list a few months ago with a question
> about the word break rules, and today I've run into several more of what I
> think are disagreements between the word break rules and the published word
> break test cases.
> *First Issue*
> This is the word break test case in question: ÷ 200D ÷ 261D ÷
> It would appear that rule 3.3 matches at index 1, i.e. the index between
> the two characters. Rule 3.3 is: $ZWJ × ($Extended_Pict | $EmojiNRK)
> Character 200D has word break property values of Extend and ZWJ, while
> character 261D has a word break property value of E_Base. Therefore, the
> left-hand side of rule 3.3 matches 200D and the right-hand side matches
> 261D. Since the rule indicates no break, I'm confused by the presence test
> case. What am I doing wrong here?
> *Second Issue*
> The other test cases my implementation is failing to pass are these:
> ÷ 0061 ÷ 1F1E6 × 1F1E7 ÷ 1F1E8 ÷ 0062 ÷
> ÷ 0061 ÷ 1F1E6 × 1F1E7 × 200D ÷ 1F1E8 ÷ 0062 ÷
> ÷ 0061 ÷ 1F1E6 × 200D × 1F1E7 ÷ 1F1E8 ÷ 0062 ÷
> ÷ 0061 ÷ 1F1E6 × 1F1E7 ÷ 1F1E8 × 1F1E9 ÷ 0062 ÷
> In all cases, the issue lies with the expected non-break between the
> second and third characters, eg. 1F1E6 and 1F1E7. The word break property
> value of both these characters is Regional_Indicator. The only rule that
> looks like it might match is 15: ^$Regional_Indicator ×
> $Regional_Indicator. However, rule 15 does not match.
> Thanks for your help in advance!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CLDR-Users