RE: Arabic - Alef Maqsurah

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Wed Jul 14 1999 - 20:16:45 EDT


Hi Ken,

> -----Original Message-----
> From: kenw@sybase.com [mailto:kenw@sybase.com]
> Sent: Wednesday, July 14, 1999 5:50 PM
> To: greynolds@datalogics.com
> Cc: unicode@unicode.org; kenw@sybase.com
> Subject: RE: Arabic - Alef Maqsurah
>
>
>
> But engaging in an exercise of how to spell Arabic if you could invent
> Unicode from the ground up might turn out to just be
> confusing. You should

I hope not; the idea is to clarify. This doesn't exclude complexity; Arabic
writing practice is complex.

> keep this discussion in the context of existing decisions
> that have been
> made about all this in Arabic implementations that predate the Unicode
> Standard.
...
> itself. So how Arabic is spelled in computer implementations
> is the result
> of a long history of practice. Unicode didn't just invent that out of
> thin air.
>

No doubt. But personally I've never accepted "that's the way its always
been done" as sufficient reason to accept the status quo. On the contrary I
think we're obligated to point out what we perceive to be problems. Who
knows, maybe the powers that be in Unicode will get something useful out of
this discussion. My personal project is to model the working of Arabic
texts, so my loyalties are to the language, not to legacy software. Of
course I understand people with money invested in current solutions also
have an interest in the discussion. But a clear (mathematical) model would
be in everybody's interest. (OTOH, if people on the list find this
excruciating and obnoxious just tell me and I'll go mutter in a corner.)

> > In particular, I
> > would argue that it is a mistake to associated the
> structure of text with
> > keyboard input, as the Unicode book does.
>
> It does not.
>
> The text on "Logical Order" on p. 2-7, which might be taken
> as implying what
> you state, is presented in the context of bidirectional
> rendering, where it
> has long been understood to be in contrast to "Visual Order"
> -- the practice
> of storing text in reversed order in the memory representation.

Perhaps; I understood it to mean what it says: "For all scripts Unicode
text is stored in *logical order* in the memory representation,
corresponding to the order in which text is typed on the keyboard". I can
live with this; but in my judgement, anyway, not only should the reference
to keyboarding be stricken, so should any reference to "memory
representation", "backing store" & etc. None of it is necessary to a
logical description of text. Its not a matter of huge substance; I just
think it would make the standard easier to read and more
implementation-neutral.

>
> I agree that the Unicode Standard could make a more prominent
> statement
> that input methods are distinct from text representation --
> but up until
> now, most people in the field have just assumed that. Maybe
> we are missing
> stating the obvious more clearly.

Indeed I think some explicit discussion of the relation between input
method, text syntax, and output representation would improve the standard.
Lots of people are looking at this unicode thing, and in my experience,
people outside of the technical field and even many within it completely
misundertand it. Much of the confusion (IMHO) is due simply to loose
terminology.

>
> > But literacy in Arabic is
> > rather different than literacy in, say, English (to put it
> mildly). It
> > requires a much greater degree of theoretical grammatical
> knowledge. So for
> > a computer to behave intelligently with respect to Arabic
> texts, the mere
> > recording of visual shapes is insufficient.
>
> But I don't really *see* your point here. For a computer to behave
> intelligently with respect to text in *any* language, the mere
> recording of visual shapes in insufficient. Are we dealing with
> some Arabic essentialism here? Why is this a particular problem for
> the Arabic script that wouldn't equally as well turn up in the Latin
> script or any other?
>

I think it probably does turn up for many languages - remember my concern is
with encoding texts in the language, not the script. It's not a question of
essentialism (whatever that is) but peculiarlism. (In two words: clitics
and non-concatenative morphology.) It goes back to the question of what a
reasonable literate should be entitled to expect out of digital text. For
example, in Arabic (or any Semitic language for that matter) this means,
among other things, intelligence with respect to word structure. One could
argue that it is unreasonable for an English reader to expect to be able to
search for all words related to "sing", for example, and find "sung",
"song", etc. I mean the encoding of such information in the data could
reasonably be considered beyond the scope of the encoding definition, so the
capability would be a matter of specialized software. But the sequential
nature of text encoding matches up well with the structure of some languages
(e.g. English and presumably most Indo-European languages; also Japanese);
you can do a lot just by manipulating sequential strings of "characters",
and you don't need much in the way of metalinguistic codes that are not
already present as inking characters (e.g. punction).

But it doesn't work that way in Arabic and many other languages. It's not
only perfectly natural to think in terms of word roots abstracted away from
particular word forms, it would be positively un-Arabic to think about the
language in any other way. The argument I will make (eventually; it's
quitting time just now) is that such structural information is rightfully
part of the standard encoding; the intelligence should be moved from
specialized logic in software and embedded in the text. (I hope you'll
forgive me if this has all been done to death previously.) But of course
this is the kind of argument that must be backed up by numerous detailed
examples, and right now I've got to run.

g'night,

gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT