I'm currently implementing the bidirectional algorithm and, while
testing my version, ran into some issues with the provided reference
implementations. (http://www.unicode.org/Public/PROGRAMS/)
1. BidiReferenceJava supports Unicode 6.3.0, but has not been updated
for later versions.
In particular, the changes from revision 33 of UAX#9 (corresponding to
Unicode 8.0.0; most notably, limitation of maximum depth for nested
brackets in the PBA, and the rules for handling NSMs following brackets
in rule N0) are missing.
Now the README of BidiReferenceJava mention that it implements Unicode
6.3.0 (and hasn't been updated since), but this should probably be made
more explicit. Maybe move the current implementation to a "6.3.0"
subdirectory? (Similar to BidiReferenceC)
--- 2. I am reasonably certain I found a bug in BidiReferenceC (version 9.0.0). Consider these two test cases: (in the same format as BidiCharacterTest.txt): 0061 0028 0062 0029 0300 05D0;1;1;2 2 2 2 2 1;5 0 1 2 3 4 0061 0028 0062 0029 001B 0300 05D0;1;1;2 2 2 2 x 2 1;6 0 1 2 3 5 This concerns runs of NSMs following a paired bracket, and how they interact with BNs (or, in the right circumstances, other types removed by Rule X9). The first is "a(b)<NSM>A" (A denoting a R-class character) in a RTL embedding. This test, when run through BidiReferenceC, produces the expected result. The key steps are as follows: 1. Classification before the weak types phase is L ON L ON NSM R 2. Weak types phase produces L ON L ON ON R 3. Rule N0 resolves bracket pair (2,4) to L; the original NSM following the closing bracket gets set to L (as per the last clause of rule N0) as well. L L L L L R 4. Level assignment produces the given expected result The second test simply adds an ASCII escape character (class BN) before the NSM. Here, BidiReferenceC produces this result: Text: 0061 0028 0062 0029 001B 0300 05D0 Bidi_Class: L L L L BN R R Levels: 2 2 2 2 x 1 1 Exp Levels: 2 2 2 2 x 2 1 Mismatches: ^ Runs: <R------------------------------R> Order: [6 5 0 1 2 3] Exp Order: [6 0 1 2 3 5] which I believe to be incorrect. The only difference to the previous run is the presence of the BN-type character before the NSM (which should not matter, since it's supposed to be removed by Rule X9 before we ever enter the weak types phase). The problem appears to be around brrule.c:4376, in the function "br_SetBracketPairBC". The code is written to detect a run of NSMs following the brackets, but does not skip over deleted characters (which are denoted by having "level == NOLEVEL"). Can anyone confirm whether my interpretation of the rules is correct and this is an actual bug in BidiReferenceC? Thanks, -FabianReceived on Thu Dec 08 2016 - 23:12:23 CST
This archive was generated by hypermail 2.2.0 : Thu Dec 08 2016 - 23:12:24 CST