Re: 3 big bidi bugs

From: Mark Davis (
Date: Thu May 30 2002 - 00:10:49 EDT

The text of L2 is as follows:

L2. From the highest level found in the text to the lowest odd level
on each line, reverse any contiguous sequence of characters that are
at that level or higher.

It is not the following:

L2'. From the highest level found in the text to the lowest odd level
on each line -- skipping levels for which there is no character with
that level -- reverse any contiguous sequence of characters that are
at that level or higher.

However, at one other person has also misinterpreted L2 as L2', so the
UTC had already authorized clarifying the text (thought that won't
happen until 3.2).

I believe other people addressed the other two items you thought were


 "Eppur si muove"
----- Original Message -----
From: "Bernard Miller" <>
To: <>
Sent: Wednesday, May 29, 2002 08:57
Subject: 3 big bidi bugs

> This letter describes 3 major technical problems with the current
> bidirectional algorithm as described in UAX #9, version 3.20.
Problems 1 and
> 3 have security implications. Other problems with the whole Unicode
> bidirectional encoding approach, and their solutions, are discussed
in the
> recently updated Bytext FAQ and documentation (
> (1) Line width dependent mangling, general case:
> Step L2 of UAX #9 indicates that a line that resolves into a
sequence of
> characters with homogenous embedding levels will ALWAYS be displayed
> to left, regardless of what the embedding level is.
> So, for example a line that with the L1 resolved embedding levels
> 2222222222222222222222222 will display right to left
> 3333333333333333333333333 will display right to left
> 4444444444444444444444444 will display right to left
> etc
> Likewise:
> in 3333333333333333333333331, the 3's will display left to right
> in 5555555555555555555555551, the 5's will display left to right
> etc
> It directly contradicts the writers intentions. It means that
> Unicode compliant applications will display the same characters in a
> different order (depending on available line width). Examples of how
this is
> bad are given in question 12 of the Bytext FAQ
> This can be fixed by rewording step L2 such that a reversal happens
from the
> highest embedding level to each lower contiguous embedding level,
> if the embedding level is represented by a character on the line,
until the
> embedding level of 1 is reached (or, as an optimization, until the
first odd
> embedding level equal to or lower than the lowest embedding level
> represented by a character on the line).
> (2) Line width dependent mangling, spelling conventions for quotes:
> What is the purpose of step X10 if not to allow something like LEFT
> QUOTATION MARK to be used as if it was an OPEN DOUBLE QUOTATION
> simply puts an embedding inside a quotation, such as
> The problem with this is that it only works if the quotation begins
and ends
> on the same line. Examples of how the text is mangled when the
> spans multiple lines are given in question 13 of the Bytext FAQ
> (
> This cannot really be fixed with minor changes other than to notify
> that the whole left=open, right=closed idea may not work as such
when the
> default automatic line breaking is used. Users should not rely on
> spelling conventions that do not bypass the effects of step X10 and
> mirroring --how this can be done is described in the Bytext
> (3) Mirroring ambiguities:
> What if eor = sor?
> text: R RLO whatever PDF N LRO whatever PDF
> embedding level at step X9: 1 3 3 1 2 2
> directional type at step X10: R R R ? L L
> The above example should be in a monospace font. The original is at
> Step X10 is ambiguous whether the "N" should be L or R. This means
that if N
> is has the mirrored property, some implementations might display the
> mirrored form, others the non mirrored form, and others might result
in an
> error.
> This can be fixed by deciding on a single form for such cases. Also,
> statement: "for two adjacent runs, the eor of the first run is the
same as
> the sor of the second" needs to be removed because it is not true.
> Bernard
> ---
> Bernard Rafael Miller, email:
> Format enabling simplified 8 bit regexes of UCS characters:
> ---

This archive was generated by hypermail 2.1.2 : Wed May 29 2002 - 22:40:15 EDT