From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 23 2005 - 16:10:30 CDT
From: "Peter Constable" <petercon@microsoft.com>
>> From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
>
>
>> Isn't that too much simplist? Actually, text also contains characters
> with
>> weak directionality like punctuation, or with directionality that does
> not
>> depend on the script used for text, like digits, and symbols...
>> The only accurate thing to do is to parse the line to see where the
> Bidi
>> algorithm reverses the visual order
>
> In the general case, you're right. My experience has been with Biblical
> Hebrew text that isn't mixed with other weakly-directional or neutral
> characters, so was narrowly thinking of only that.
My opinion is that the requester was speaking about existing documents
probably in moder hebrew, that mix hebrew letters with standard
punctuations. In that case, the effect of mirroring with BiDi will cause
characters like parentheses to be incorrectly oriented if the original
document coded the ASCII open-parenthese before the reversed hebrew word,
and then the ASCII close-parenthese.
And there will be other strange effects on ending punctuations encoded
visually *before* hebrew words (for example full stop dots, commas) and that
the Bidi algorithm may reverse as well, depending on the context where the
punctuation initially appeared.
So the only way is effectively, after mapping characters from the legacy
overriden code positions to Unicode codepoints:
1) to parse the text with the complete Bidi algorithm, to determine the
effective RTL or LTR directionality infered for each character, and see
which sequences are effectively visually reversed by it.
2) Then for each RTL sub-sequences determined above, you must reorder it
from the legacy visual order to the Unicode logical order, by swapping all
characters in that sequence.
3) Then we must make sure that combining sequences are correctly ordered: to
restore to the normal order, with the base letter first and the diacritics
after, this requires a second scan of the reversed sequence to find the base
characters and reverse each splitted combining sequence.
With such algorithm, you can avoid using the LRO/PDF tweak, and the
converted document will be fully indexable and searchable the normal way,
because it will now be fully encoded in logical order.
For converting HTML or rich-text documents using legacy overriden LTR code
positions with legacy "visual" fonts, it is more complex because it requires
first scanning the syntax of the HTML or rich-text format, and then apply
the algorithm above only and isolately to the fragments identified as
plain-text, but not to syntaxic characters (in HTML in XML: < or > or <! or
<! or " around attribute values or = between attribute names and values).
One can use the DOM services for example to perform this scan and the apply
the necessary transformations.
For Word documents, this scanning to isolate the text fragments is easy to
perform with a Macro in Basic, by using the document class model and its
enumerators. Same thing for Excel or other Office-compatible documents.
This archive was generated by hypermail 2.1.5 : Mon May 23 2005 - 16:11:20 CDT