From: Michael D. Adams (mdmkolbe@gmail.com)
Date: Fri Sep 10 2010 - 18:07:18 CDT
Rules W2 through W7 for the Bidi Algorithm
[http://www.unicode.org/reports/tr9/] are rather confusing to read.
They are not confusing as to what to do but as to why they are done
and how to efficiently implement them. After many hours puzzling over
them I think I've found a simpler way to define them. Is the
following definition equivalent to the specification's rules? If so
why isn't the Bidi Algorithm defined using this simpler specification?
My simpler specification is as follows:
Assume standard regular expression syntax where suffix "|" is
alternation, suffix "+" is one or more repetitions, and suffix "*" is
zero or more repetitions. Let "X sep-by Y" be a shorthand for one or
more "X" separated "Y" (i.e. X (Y X)*). Let "X bracket-by Y" be a
shorthand for one or more "X" separated and surrounded by "Y" (i.e. "Y
(X sep-by Y) Y" or "Y (X Y)+"). Upper case characters represent the
bidi_class of a character.
Define a SequenceOfEuropeanNumbers to be a maximally long contiguous
sequence of characters that match "((EN NSM)+ sep-by ((ES|CS))
bracket-by ET*".
Define an ArabicNumber to be a maximally long contiguous sequence of
characters that match "AN+ sep-by CS".
Define a EuroArabicNumber to be a maximally long contiguous sequence
of characters that match "(AN|EN)+ sep-by CS".
Between each strong character (AL,L,R,sor) and the next strong
character (AL,L,R,eor):
If the leading strong character is L then:
(1) change the class of all characters in a SequenceOfEuropeanNumbers to L
(2) change the class of all characters in a ArabicNumber to AN
(3) change all other characters to ON.
If the leading strong character is R then:
(1) change the class of all characters in a SequenceOfEuropeanNumbers to EN
(2) change the class of all characters in a ArabicNumber to AN
(3) change all other characters to ON.
If the leading strong character is AL then:
(1) change the class of all characters in a EuroArabicNumber to AN
(2) change all other characters to ON.
At this point all AL characters can be changed to R and the normal N1
and N2 rules resumed.
I believe specifying things this way is more intuitive than the
existing way and will make it easier for implementers to properly and
efficiently implement. Am I wrong? Is there a good reason W2 through
W7 are they way they are? If not, can they be changed to this simpler
specification?
Michael D. Adams
This archive was generated by hypermail 2.1.5 : Fri Sep 10 2010 - 18:14:18 CDT