Re: 3 big bidi bugs

From: Mark Davis (mark@macchiato.com)
Date: Thu May 30 2002 - 13:01:53 EDT

Previous message: Michael \(michka\) Kaplan: "Re: Towards some more Private Use Area code points for ligatures."
In reply to: Bernard Miller: "RE: 3 big bidi bugs"
Next in thread: Bernard Miller: "from 4 to null (was: 3 big bidi bugs)"
Reply: Bernard Miller: "from 4 to null (was: 3 big bidi bugs)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

1. One needs to be rather pragmatic about determining whether
something is unclear or not. If people misinterpret the text, then it
needs improvement. So clearly the text should be improved (and as I
mentioned, the UTC already decided to do this).

2. Now, about your issue. Let's take some examples:
If you have 4444444, no action is taken.
If you have 111111, then the text is processed once (1).
If you have 4444441111114444, then the text is processed four times
(4,3,2,1).
If you have 222255552222, then the text is processed three times (5,
4, 3)

Now as a matter of fact, you could always go processing down to any
odd level that is at or below the lowest level found; you could
process always down to 1 if you wanted. That would give the same
results, since any pair of reversals of *all* the text result in the
same answer. So that would be simplest to just state the algorithm
that way to prevent confusion; implementations can always optimize as
long as they give the same results.

So probably the simplest restatement of the text would be:

L2b. From the highest level found in the text to level 1 on each line
(including levels that do not occur on the line), reverse any
contiguous sequence of characters that are at that level or higher.

3. The quotes should go on the outside of the text, since they are
logically part of the enclosing text, not the enclosed text. If the
adjacent outside text is of the wrong flavor, you'll have to insert a
mark. e.g.

RLE ... L " LRE X X X PDF " ... PDF
=>
RLE ... L RLM " LRE X X X PDF " ... PDF

4. As to complexity of BIDI: the algorithm was worked over by quite a
number of people over many years to get the results that it has. One
of the principal goals is interoperability; that everyone get the same
results on any machine. It has always been recognized that one would
need to add marks and/or embeddings in some cases to get the desired
results.

One could wish for a simpler algorithm (for that matter, one could
wish that people had uniform writing directions, or that Brits would
drive on the right side of the road). As to ByText, you are on your
own (in many ways).

Mark
__________

http://www.macchiato.com

"Eppur si muove"
----- Original Message -----
From: "Bernard Miller" <Bernard_R_Miller@bytext.org>
To: "Mark Davis" <mark@macchiato.com>; <timpart@perdix.demon.co.uk>
Cc: <unicode@unicode.org>
Sent: Thursday, May 30, 2002 08:36
Subject: RE: 3 big bidi bugs

> Mark Davis wrote:
> > [L2] is not the following:
> ...
>
> I'm glad to hear that "bug" 1 is not how L2 is intended to work
(this means
> that the answer to FAQ question 12 "Is Bytext bidirectionality
compatible
> with Unicode bidirectionality?" is simply yes, instead of a
qualified yes).
> I don't wish to give the impression that I care too much about
semantic
> errors, but if you can't acknowledge that what was said in L2 was
not what
> was intended (instead of just being unclear) I'm going to have to
call you
> on that:
>
> Let's say you have a line consisting of characters with all
embedding level
> 4... How is "3" considered to be the lowest odd level on that line?
It's no
> more the lowest odd level than 5 or 1 is. At best, if you consider a
> character with embedding level 4 to actually consist of 4 and each
lower
> embedding level (4, 3, 2, 1, and zero), which is not entirely
unreasonable,
> then 1 will always be the lowest odd embedding level on every line
except a
> line consisting of all zero's. But since L2 doesn't say "...to 1",
it rules
> out this interpretation.
>
> A function implementing L2 might go thru the following steps on each
line:
> 1. find the highest level
> 2. find the lowest odd level
> ...
> For a line consisting of all 4's as above, step 1 will return 4 and
step 2
> should return null since there are no odd levels on the line. A list
> consisting of "from 4 to null" can only reasonably be interpreted as
> consisting only of 4. Going on with this you get the "bugs" I
describe.
>
> If you are familiar with each implementation of the algorithm, it
might be
> reassuring to users if you can state that none actually work in the
manner
> above. Any other implementations might want to test for this.
>
> > I believe other people addressed the other two items you thought
were
> > bugs.
>
> Other people have not addressed "bug" 2 accurately. Here's an
impromptu
> shorthand to summarize the issue:
>
> RLE..."LRE...PDF" looks ok on 1 or more lines, unless a strong L
character
> precedes or follows the quotation, as in: RLE...L "LRE...PDF"
>
> LRE..."RLE...PDF" looks ok on 1 or more lines, unless a strong R
character
> precedes or follows the quotation, as in: LRE...R "RLE...PDF"
>
> LRE...RLE"..."PDF looks ok on 1 line, looks messed up on multiple
lines
>
> RLE...LRE"..."PDF looks ok on 1 line, looks messed up on multiple
lines
>
> This "bug" is weaker than I originally thought, but it still belongs
in
> question 13 of the Bytext FAQ "How is using bidirectionality in
Bytext
> easier than in Unicode?"... even Tim Partridge didn't get it right
as to how
> to spell embedded quotations ("Surely if the quotation is meant to
be right
> to left the RLE and PDF should
> be outside the entire thing, including the quotes"). These kinds of
issues
> can be summarized as an overdependence on character properties,
language
> specific conventions, and formatting characters with overlapping
> functionality that allow multiple spellings for the same formatting.
In
> other words, as others have said, the Unicode bidirectional
algorithm is too
> complex. The (new) Bytext encoding of bidirectionality shifts the
complexity
> to the level of transcoding and to input methods. It effectively
eliminates
> multiple encodings that achieve the same embedding levels, so like
> everything else in Bytext it is more regular expression friendly.
>
> Bernard
> ---
> Bernard Rafael Miller, email: bernard_r_miller@bytext.org
> Format enabling simplified 8 bit regexes of UCS characters:
www.bytext.org
> ---
> "We believed that the cybernetic approach to consciousness, whipped
up
> frothy, would carry us to a plateau overlooking a pleasant mirror,
but
> instead left us blathering in the dressed up solitude of mannequin
> planets." --Steven Jesse Bernstein
>
>

Previous message: Michael \(michka\) Kaplan: "Re: Towards some more Private Use Area code points for ligatures."
In reply to: Bernard Miller: "RE: 3 big bidi bugs"
Next in thread: Bernard Miller: "from 4 to null (was: 3 big bidi bugs)"
Reply: Bernard Miller: "from 4 to null (was: 3 big bidi bugs)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Thu May 30 2002 - 11:12:27 EDT