Serious problems with Arabic

From: Roozbeh Pournader (roozbeh@sharif.edu)
Date: Tue Nov 21 2000 - 07:15:51 EST

Next message: Asmus Freytag: "Re: Greek Prosgegrammeni"
Previous message: Michael \(michka\) Kaplan: "Re: Data entry in chosen language"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Dear All,

I have serious problems with Unicode Arabic. The main problem is with the
Arabic shaping rules in TUS 3.0, pages 192--197. I think these should be
changed in some suggested ways. Would someone please guide me on how
should I prepare an official suggestion?

1. "Bidi vs Cursive Joining"

Page 192 mentions:

        "An implementation may choose to restate the following rules
         according to logical order so as to apply before the bidirectional
         algorithm's reordering phase. In this case, the words right and
         left as used in this section would become preceding and
         following."

But the effect is not the same! Consider the sequence

        U+0628 U+202D U+0627 U+0631 U+202C
        BEH LRO ALEF REH PDF

If you apply bidi to this, you'll obtain

        ALEF REH BEH

which will then become

        ALEF<isolated> REH<final> BEH<initial>

after cursive joining. But now try to reverse the order. First apply joining
and then bidi. Having in mind that LRO is transparent regarding joining,
(page 192, table 8-2 includes all format marks as being transparent; RLM is
included as an example, so we can deduce that by format marks, TUS means the
characters in the character class Cf, "Other, Format").
first you'll have

        BEH<initial> LRO ALEF<final> REH<isolated> PDF

and after bidi,

        ALEF <final> REH<isolated> BEH <initial>

The former case is unacceptable because BEH and REH which are not adjacent
in logical order (this is the order one reads the text aloud), have joined
together, where one cannot find that they were not adjacent. The latter form
is also unacceptable, since you have a final ALEF, but it joins to nowhere
(you have not requested this, because you have not mentioned any ZWJ
in the text). It seems that this is the case that may occur with Arabic
enabled editors, when user is playing with the text. And it seems that
both solutions are probelmatic. UAX #9, in Reordering Resolved Levels,
recommends the latter case.

My suggestion is making the five controls RLE, LRE, RLO, LRO, and PDF
non-joining and not transparent, and also specifically asking the
application to do the joining after bidi which will solve the problem.
Because when someone uses the explicit marks, he wants to render the text
in different levels, so he does not consider them be joined together.
BTW, the Arabic-aware applications should consider the Retaining Format
Codes part in UAX #9 now.

2. "Transparency of Canonnical Decomposition"

The standard claims transparency according to cannonical decomposition.
The text should have the same behaviour if it is decomposed. But this is
not true regarding shaping U+06C0, ARABIC LETTER HEH WITH YEH ABOVE. It
decomposes to U+06D5 U+0654 which is ARABIC LETTER AE + ARABIC HAMZA
ABOVE, while HEH WITH YEH ABOVE is in the right-joining class and AE is in
the non-joining class. This will create problems for example with normal
Persian texts using the HEH WITH YEH ABOVE. If one has the very common

   <KHAH> <ALEF> <NOON> <HEH WITH YEH ABOVE>

(I'll follow the logical order, bidi is of no importance here), and then
shapes that, he will get

   <KHAH-initial> <ALEF-final> <NOON-initial> <HEH WITH YEH ABOVE-final>

but if he decomposes that and then applies the shaping, he will get

   <KHAH> <ALEF> <NOON> <AE> <HAMZA ABOVE>

and then

   <KHAH-initial> <ALEF-final> <NOON-isolated> <AE-isolated> <HAMZA ABOVE>

The last two are visually equal to <HEH WITH YEH ABOVE-isolated>. You can
see the difference between the shaping of NOON and AE. This is unbearable.

My suggestion would be decomposing U+06C0 to

        U+0647 U+0654 U+200C
        <ARABIC LETTER HEH> <ARABIC HAMZA ABOVE> <ZERO WIDTH NON-JOINER>

which seems to be the only solution for this. I again insist that this
case appears really frequently in Persian, where HEH WITH YEH ABOVE is
very common.

--roozbeh

Next message: Asmus Freytag: "Re: Greek Prosgegrammeni"
Previous message: Michael \(michka\) Kaplan: "Re: Data entry in chosen language"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT