RE: Why Arabic shaping?

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon Aug 13 2001 - 06:01:19 EDT


Peter Constable wrote:
>Philipp Reichmuth wrote:
>>David Starner wrote:
>>>ensuring that what he
>>>typed would be saved in exact-mode (what-you-see-is-what-you-store --
WYSIWYS
>>>:-)
>>
>>Except that this is not what Unicode is about: Unicode is about
>>what-you-store-is-what-you-mean.
>
>I agree (to this and pretty much everything else in Philipp's response).
>If you want what-you-store-is-what-you-see, use PDF.

I too agree with Philipp, but I must note that he mostly explained why it is
not wise to encode Arabic *ligatures*.

But I think that David's question was more about encoding the contextual
form of *single* Arabic letters. After all, it is easy to see Arabic
contextual forms as a thing very similar to European case variants.

So a devil's advocate may ask: if the Arabic shaping forms of Kaaf have been
unified in the same code point, then why Latin uppercase and lowercase K
haven't been unified as well? And, conversely, if Latin case variant have
been assigned to different code points, why not Arabic shape variants?

The pros and cons of the two problems are relatively similar: disunifying
Latin case variants makes search and sort slightly more complicated;
unifying them makes search and sort simpler but complicates the display
process, and requires the introduction of "zero width uppercase" and "zero
width lowercase" controls.

Similarly: disunifying Arabic shape variants makes search and sort slightly
more complicated; unifying them makes search and sort simpler but
complicates the display process, and requires the introduction of a "zero
width joiner" and "zero width non joiner" controls.

Of course, I think I know the short answer: both the Latin and Arabic part
of Unicode descend from ISO-8859, a pre-existing standard, which encoded the
two scripts this way.

However, this may be an unsatisfying answer, especially out of
standardization circles, so someone may come up with more philosophical
answers.

Now I go back acting as an angel's advocate, and try giving two possible
justifications for Unicode:

1) While the difference between upper and lower case is very clear, how to
count Arabic shape variants is not as clear. Traditionally, "dual linking"
letters are considered to have four shapes (initial, medial, final,
isolate), while "right linking" letters have two (final, isolate). However,
there is another way of counting which ignores the tiny differences (on the
right side of letter) that differentiate initial from medial and final from
isolate forms. With this method, most "dual linking" letters have two shapes
(non-final, final) and "right linking" have a single shape. Which one of
this system should be the basis of a hypothetical shape encoding?

2) In the majority of cases, the choice of Arabic shapes is determined by
simple language-independent rules based only on the two neighboring
characters. The rules are simple enough to be incorporated in a software
component to handle rendering. The exception to these rules are rare enough
to make sense handling them with an escape mechanism (the ZWJ and ZWNJ
controls). On the other hand, choosing whether to use a capital or a small
letter derives from complicated grammatical rules. These rules change
considerably from language to language, and are also influenced by stylistic
choices. In practice, the only capitalization rule that can be automated is
the capital letter at the beginning of a sentence. However, it is not so
easy to automatically determine the beginning of sentences!

_ Marco



This archive was generated by hypermail 2.1.2 : Mon Aug 13 2001 - 07:28:26 EDT