Arabic - Alef Maqsurah

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Wed Jul 14 1999 - 09:09:45 EDT


I've been working on a thorough analysis of the encoding of Arabic text,
both semantics and presentation, but events keep conspiring to prevent me
from finishing the blasted thing, so rather than post a nicely typeset and
closely argued paper and send out the URL, I'm going to post messages on a
few topics and see if anybody else is interested. What I'd like to end up
with is a freely available, very thorough reference for encoding and
especially for typesetting Arabic. So this note is the first of a series
directly concerned with Unicode's handling of Arabic. My apologies if this
stuff has already been gone over; I don't think it has, but I haven't had
the time to monitor the list closely so I may have missed some stuff.

The perspective I start with is this: if the computer had been invented in
an Arabic-speaking culture, what would character encoding look like? What
would have been expected as ordinary?

The larger theme, aside from specifically Arabic stuff, is that a
superficial, grapheme-level encoding is not sufficient, and that more
explicit metalinguistic encoding is needed. In other words, an encoding
that wants to work across a multitude of languages must support the encoding
of grammatical information, and abstract character semantics is insufficient
for this purpose. Or another way to think about it is that an encoding
should capture the active knowledge used by the reader and absent from the
presentational form. My reading of Unicode in its current form is that it
does this to some extent but needs to do more. Arabic provides some good
examples of where this is needed.

Some notes on alef maqsurah:

Unicode codepoint U+0649 is defined as "alef maqsurah", and the
representative glyph shows a dotless ya. This is not so much incorrect as
misleading. The term "alef maqsurah" denotes a phonological, not a
graphemic, phenomenon. It may be represented by either a dotless ya (ram_A,
he threw) or by a plain old alef (ghazA, he invaded). In both cases, the
naming of the "character" represents a phonological analysis: its an alef
pronounced short when followed by a consonant.

This is a case where Unicode could be improved by a sharper distinction
between abstract character semantics and (abstract) presentational
semantics. If the character semantics of U+0649 are to be "alef maqsurah",
then the text should make clear that the presentation of the character may
use either of two forms, "dotless ya" or "alef" (alef=U+0627). In this
case, we need another codepoint for dotless ya. The alternative would be to
change the semantics of U+0649 to "dotless yah" (a purely presentational
semantics) and add a codepoint for alef maqsurah. I would prefer the
former, myself.

In either case, both dotless ya and alef maqsurah are needed. Dotless ya
would serve an additional purpose, since it is commonly used in Egypt to
denote the abstract character value "ya"; this is another place where a
sharper distinction between abstract character and presentation would be
useful. It should be possible (IMHO) for an Egyptian writer to choose the
dotless ya form to represent the dotted ya semantics. This is really no
different than supporting upper and lower case, except that dotless ya maps
to more than one abstract character.

A minor point: it should be "maqsura", not "maksura", since it is spelled
with U+0642 "qaf", not U+0643 "kaf".

The standard reference (in the US) for this sort of stuff is Wright's
Grammar; he goes into the handling of these issues in excruciating detail.
It's over 100 years old, so I assume the copyright has lapsed, and I've
begun entering it into XML form. If anybody wants to see the material
concerning alef maqsurah, etc., let me know and I'll forward it or post it.

Sincerely,

Gregg Reynolds



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT