Re: Why Arabic shaping?

From: Philipp Reichmuth (uzsv2k@uni-bonn.de)
Date: Sat Aug 11 2001 - 09:41:51 EDT


Hi David,

sorry for the long reply, it's a bit huge, but i hope it helps :-) and
BTW by "you" I'm mainly addressing your contact person :-)

>> David Starner wrote:
>>> Arabic Presentation Form A and B shouldn't be used in files; use characters
>>> in the 0600-06FF block and the application should take the responsibility
>>> for using glyphs from Presentation Forms A & B if neccesary.

>>Well, it _always_ will be necessary and that's my point (its not even almost
>>always, its "always" :-). 0600-06FF presents a flavor of the entire Arabic
>>alphabet (each letter is represented in _a_ particular form - initial, medial,
>>final and isolated), it also includes all the Arabic numbers and punctuation,
>>but 0600-06FF, by all means, is not complete since it doesn't include all the
>>various character permutations (forms).

Unicode is, however, not concerned primarily with what characters look
like. Unicode does not encode glyphs or visual appearances, it encodes
characters.

The Arabic script has a very high degree of variation in appearance of
the individual letters. This is also highly dependent on the style or
script in which the particular text is written; Nasta'liq, Naskhi,
Shekaste and Hijazi script styles, for example, have wholly different
ligature sets which are not at all completely covered in Unicode.
However, they don't _need_ to; if, say, Shekaste has a ligature for
the letters XYZ, it is completely the rendering system's (i.e. mainly
the font's) issue to display this in pretty Shekaste form. Unicode
encodes the underlying characters, that is, XYZ.

>>Why do it this way :-D ? Are there some hidden advantage that I'm not
>>thinking of (beside saving font space) ?

Yes:

- Searching and comparison are easier because if you want to search or
compare the letters XYZ, that way you only have to look for the
letters, possibly sans vowelization symbols. If one uses presentation
forms for encoding, you have to search/compare X + YZ, XY + Z, X + Y +
Z and XYZ separately which is a real pain, extremely complicated to
implement and prone to errors of all sorts.

- Vowelization of ligatures is impossible. On a three-letter
combination, in theory, there can be vowelization signs, recitation signs
and all sorts of diacriticals on each of the three letters. It is
technically impossible to place different vowels on different
consonants in a Unicode Arabic presentation form, however. If you use,
say, OpenType, it does the vowel placement for you even
in ligatures.

- If you want to use a script style that does not contain all (or
contains more) the ligatures from Unicode Presentation Forms, you have
a problem. The ligatures from Unicode are based on Naskhi. If you want
to write a text in Naskhi and later reformat it in Shekaste where the
ligatures are completely different, the program has to go to real
pains, replacing the characters everywhere _in the entire file_. This
process is much more complicated than having the system-built-in
rendering engine render a paragraph.

- What is described as "reverting to the visual re-mapping every time
this file is opened" is not something the programmer has to care
about. This is done by the operating system and by the font. The speed
and memory loss is next to irrelevant on modern computers, and Unicode is not
supported on older platforms anyway.

>>(currently all that visual conversion would be lost, right ?)

No. They're not needed at all for _storage_, they are needed for
_display_. The next time you open it, they're back again.

>>and is stored on disk
>>using only 0600-06FF encodings. Why not preserve all these conversions so
>>that if someone wanted to read my 15MB :-) file they wouldn't have to wait
>>for any more conversions to take place (its a waste of time and processor
>>throughput) ?

No. Conversion is probably done on a per-paragraph basis by the
rendering engine which does, in practice, not take that long. If you
want a comparison, try opening a Word document with Arabic text in
Simplified Arabic and in DecoType Naskh fonts, you're not going to
notice the difference at all, probably, regardless of all the nice
output done in DecoType Naskh.

The document, in fact, will probably not even be much larger. For
example, the maximum ligature length is three (not counting, for
example, the ALLAH ligature). It's improbable, however, that your
document that way shrinks by a factor of three, since not every
character is in a ligature, and a word of, say, four letter still has
to consist of at least two ligatures. So let's agree that you can
shrink your data by a factor of two if the text does not contain
vowels. Now, all the Arabic ligatures are in the FXXX range, which
means that they get really long in UTF-8, which is what most
applications use; as opposed to the Arabic characters, which get
encoded to two bytes and are comparatively short. So the size of your
document is about the same, and you lose sorting, searching,
comparison, and freedom of font choice. Not really an advantage, I'd
say.

>>You see what I'm saying ? With that in mind, I was thinking
>>that Form-B is an integral part of any unicode "Arabic" font since it needs
>>to be known (and used) by everyone (well, the converter has to have these
>>glyph from somewhere, right ?).

Yes. However, a fairly modern font like OpenType has more glyphs than
it supports characters anyway. And it does not need to be used by
everyone, that's just like forcing everyone to write their texts in
Naskhi. Take, for example, NOON + KHA + MEEM in Naskhi and Nasta'liq:
pretty different. :-)

>>It just seems odd to go this way - its certainly cleaner to include all the
>>characters and their various permutations and give the user the ability to
>>decide what he wants to type and how he wants it to look;

That's done in OpenType anyway: you can choose if you want to use a
ligature. However, if you just support Unicode presentation forms, the
user does not get any option beyond Unicode presentation forms at all,
which is quite a limitation in the processing of, say, Persian poetry.

>>ensuring that what he
>>typed would be saved in exact-mode (what-you-see-is-what-you-store -- WYSIWYS
>>:-)

Except that this is not what Unicode is about: Unicode is about
what-you-store-is-what-you-mean. What you see is the font's business.
But I'm repeating myself :-)

>>Granted that the application would still have to do this conversion (or
>>shaping), but its only done once -- upon creation. Moreover, this conversion
>>library would be universal given universal fonts and encodings (no optional
>>anything).

This optional anything is what the freedom of choosing a style for
your document is about. "Universal" Arabic fonts are impossible: the
variation of the script is so vast that it is simply impossible to
include all the varieties, ligatures, letter presentation forms and so
on in a single font. In fact, the optional anything is quite
necessary; in Unicode, you have, for example, a ligature for ALLAH,
but none for LI-LLAH, and since you want the name of God to look the
same way, you'll either have to write God as ALIPH+LAM+LAM+HA (which
does not look nice if you don't have optional ligatures in your font)
or use an extra ligature for LI-LLAH; i.e. if you want pretty output,
you need optional, non-Unicode ligatures one way or the other.

Say, you have the Qur'an in computer-readable form and you want to
build a computer-generated index on it, like "all words derived from
the root JEEM-YAH-HAMZA". If you store all the presentation forms, you
can simply forget it because the extraction engine will have to know
all about which ligatures contain which letters in which order. If you
just store letter by letter, you can just look for the letters. Pretty
display is done by the font and by the rendering engine, the user does
not have to care about it (but he can, if he uses OpenType [for
example], still control the output): he gets pretty display, and the
software works much more easily.

Hope that helps :-)
 Philipp mailto:uzsv2k@uni-bonn.de
__________________________
The code was willing / It considered your request / But the chips were weak



This archive was generated by hypermail 2.1.2 : Sat Aug 11 2001 - 11:02:40 EDT