I am working on an implementation of the bidirectionality algorithm from the
Unicode Standard, version 2.0. I find that I have many questions about the
algorithm. Any answers or hints would be appreciated.
1. In step T6, why are implicit directional formatting codes (RLM and LRM,
according to the description on p 3-15) removed? If the purpose of RLM and LRM
is to affect the direction of weak types (as shown in the example on page 3-23)
then it would be more useful to leave them in until after the neutrals are
resolved. In fact, the example on p 3-23 does not work unless RLM is left in
until after step N1.
2. Do sot/eot refer to start and end of an embedding level, or only the start
and end of the entire block being processed? If only to the start and end of the
block, what is the behavior of the algorithm in steps P0 and N1-N3 when a level
change is detected?
3. In step P0, if a character type changes from EN to AN, should the character
also change to Arabic-Indic or remain European? For example, should U+0031 DIGIT
ONE change to U+0661 ARABIC-INDIC DIGIT ONE in this step?
4. In step I1, should this read, "Numeric text (EN) goes up two levels unless
preceeded by left-to-right text AT THE SAME EMBEDDING LEVEL"? If so, then what
happens to an EN at the beginning of an embedding level? What if the first
character after sot is EN?
5. Do quote marks and parentheses affect the embedding level in a special way?
There is no mention of these characters in the algorithm, and their character
type is "Other neutral" in UNIDATA2, but ALL of the examples containing quote
marks on pp 3-20 and 3-21 seem to require either special handling of single and
double quotes, or use of explicit embedding codes.
For example, on page 3-21:
Memory: he said "car MEANS CAR."
Resolved levels: 000000000222111111111100
Why is car at level 2? If quote has no special meaning, car should be at level
0. On the other hand, if quote is interpreted as pushing a level, then why is
the period at level 0 instead of level 1? The only way I can duplicate this
result is to insert a LRE between quote and car, and a PDF between CAR and
period.
6. In the example
Memory: car MEANS CAR.
Resolved levels: 22211111111111
shouldn't this be
Memory: car MEANS CAR.
Resolved levels: 00001111111110
unless the whole example is embedded in LRE/PDF?
7. An observation: In step N3, the sequences R N EN N R and R N EN N L should
never occur because in step P0 they resolve to R N AN N R and R N AN N L
respectively.
Thanks for any help/insights,
Kent Johnson kjohnson@transparent.com
Transparent Language, Inc. http://www.transparent.com
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT