Re: hebrew font conversion

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 23 2005 - 16:10:30 CDT

  • Next message: Gregg Reynolds: "Re: Transliterating ancient scripts [was: ASCII and Unicode lifespan]"

    From: "Peter Constable" <petercon@microsoft.com>
    >> From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
    >
    >
    >> Isn't that too much simplist? Actually, text also contains characters
    > with
    >> weak directionality like punctuation, or with directionality that does
    > not
    >> depend on the script used for text, like digits, and symbols...
    >> The only accurate thing to do is to parse the line to see where the
    > Bidi
    >> algorithm reverses the visual order
    >
    > In the general case, you're right. My experience has been with Biblical
    > Hebrew text that isn't mixed with other weakly-directional or neutral
    > characters, so was narrowly thinking of only that.

    My opinion is that the requester was speaking about existing documents
    probably in moder hebrew, that mix hebrew letters with standard
    punctuations. In that case, the effect of mirroring with BiDi will cause
    characters like parentheses to be incorrectly oriented if the original
    document coded the ASCII open-parenthese before the reversed hebrew word,
    and then the ASCII close-parenthese.

    And there will be other strange effects on ending punctuations encoded
    visually *before* hebrew words (for example full stop dots, commas) and that
    the Bidi algorithm may reverse as well, depending on the context where the
    punctuation initially appeared.

    So the only way is effectively, after mapping characters from the legacy
    overriden code positions to Unicode codepoints:

    1) to parse the text with the complete Bidi algorithm, to determine the
    effective RTL or LTR directionality infered for each character, and see
    which sequences are effectively visually reversed by it.

    2) Then for each RTL sub-sequences determined above, you must reorder it
    from the legacy visual order to the Unicode logical order, by swapping all
    characters in that sequence.

    3) Then we must make sure that combining sequences are correctly ordered: to
    restore to the normal order, with the base letter first and the diacritics
    after, this requires a second scan of the reversed sequence to find the base
    characters and reverse each splitted combining sequence.

    With such algorithm, you can avoid using the LRO/PDF tweak, and the
    converted document will be fully indexable and searchable the normal way,
    because it will now be fully encoded in logical order.

    For converting HTML or rich-text documents using legacy overriden LTR code
    positions with legacy "visual" fonts, it is more complex because it requires
    first scanning the syntax of the HTML or rich-text format, and then apply
    the algorithm above only and isolately to the fragments identified as
    plain-text, but not to syntaxic characters (in HTML in XML: < or > or <! or
    <! or " around attribute values or = between attribute names and values).
    One can use the DOM services for example to perform this scan and the apply
    the necessary transformations.

    For Word documents, this scanning to isolate the text fragments is easy to
    perform with a Macro in Basic, by using the document class model and its
    enumerators. Same thing for Excel or other Office-compatible documents.



    This archive was generated by hypermail 2.1.5 : Mon May 23 2005 - 16:11:20 CDT