UAX #9 (Bidirectional algorithm) reference implementations from Fabian Giesen on 2016-12-08 (Unicode Mail List Archive)

From: Fabian Giesen <fabiang_at_radgametools.com>
Date: Thu, 8 Dec 2016 18:41:35 -0800

I'm currently implementing the bidirectional algorithm and, while
testing my version, ran into some issues with the provided reference
implementations. (http://www.unicode.org/Public/PROGRAMS/)

1. BidiReferenceJava supports Unicode 6.3.0, but has not been updated
for later versions.

In particular, the changes from revision 33 of UAX#9 (corresponding to
Unicode 8.0.0; most notably, limitation of maximum depth for nested
brackets in the PBA, and the rules for handling NSMs following brackets
in rule N0) are missing.

Now the README of BidiReferenceJava mention that it implements Unicode
6.3.0 (and hasn't been updated since), but this should probably be made
more explicit. Maybe move the current implementation to a "6.3.0"
subdirectory? (Similar to BidiReferenceC)

---
2. I am reasonably certain I found a bug in BidiReferenceC (version 9.0.0).
Consider these two test cases: (in the same format as 
BidiCharacterTest.txt):
0061 0028 0062 0029 0300 05D0;1;1;2 2 2 2 2 1;5 0 1 2 3 4
0061 0028 0062 0029 001B 0300 05D0;1;1;2 2 2 2 x 2 1;6 0 1 2 3 5
This concerns runs of NSMs following a paired bracket, and how they 
interact with BNs (or, in the right circumstances, other types removed 
by Rule X9).
The first is "a(b)<NSM>A" (A denoting a R-class character) in a RTL 
embedding. This test, when run through BidiReferenceC, produces the 
expected result.
The key steps are as follows:
1. Classification before the weak types phase is
    L ON L ON NSM R
2. Weak types phase produces
    L ON L ON ON R
3. Rule N0 resolves bracket pair (2,4) to L; the original NSM
    following the closing bracket gets set to L (as per the last
    clause of rule N0) as well.
    L L L L L R
4. Level assignment produces the given expected result
The second test simply adds an ASCII escape character (class BN) before 
the NSM. Here, BidiReferenceC produces this result:
   Text:        0061 0028 0062 0029 001B 0300 05D0
   Bidi_Class:     L    L    L    L   BN    R    R
   Levels:         2    2    2    2    x    1    1
   Exp Levels:     2    2    2    2    x    2    1
   Mismatches:                              ^
   Runs:        <R------------------------------R>
   Order:      [6 5 0 1 2 3]
   Exp Order:  [6 0 1 2 3 5]
which I believe to be incorrect. The only difference to the previous run 
is the presence of the BN-type character before the NSM (which should 
not matter, since it's supposed to be removed by Rule X9 before we ever 
enter the weak types phase).
The problem appears to be around brrule.c:4376, in the function 
"br_SetBracketPairBC". The code is written to detect a run of NSMs 
following the brackets, but does not skip over deleted characters (which 
are denoted by having "level == NOLEVEL").
Can anyone confirm whether my interpretation of the rules is correct and 
this is an actual bug in BidiReferenceC?
Thanks,
-Fabian

Received on Thu Dec 08 2016 - 23:12:23 CST

This archive was generated by hypermail 2.2.0 : Thu Dec 08 2016 - 23:12:24 CST