Missing Arabic and Syriac characters in Unicode

From: Miikka-Markus Alhonen (Miikka-Markus.Alhonen@tigatieto.com)
Date: Sun Sep 23 2001 - 17:27:45 EDT

Previous message: From Net Link: "Re: [OT] Roman numeral arithmetic (was: Re: [lojban] (from lojban-beginners) pi'e)"
Next in thread: Charlie Jolly: "[OT] What happened to the OpenType list?"
Reply: Charlie Jolly: "[OT] What happened to the OpenType list?"
Reply: Michael Everson: "Re: Missing Arabic and Syriac characters in Unicode"
Reply: Majid Bhurgri: "Re: Missing Arabic and Syriac characters in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hello!

I saw some beautiful writings in Arabic calligraphy a while ago and began
wondering, what are all those dots and lines which appear frequently in works
of art, but which aren't vowels or parts of the base letters themselves (such as
a diacritic resembling "v" and an Arabic comma used as a non-spacing diacritic
above letters). I asked my Arabic teacher and he said they're just ornaments
which are drawn according to certain rules and they carry no meaning associated
to the language by themselves. I guess this is the reason why they haven't been
included in Unicode. But still, there are many other characters already in
Unicode, the meaningfulness of which could be questioned, e.g. many characters
in Miscellaneous Symbols U+2600-U+26FF and Dingbats U+2700-U+27BF. And the
number of these Arabic ornamental symbols appears to be somewhat small, I
doubt if there are even 10 of them. Does anyone of you know, if there are any
existing computer standards dealing with these or could they be otherwise
considered to be included in Unicode later on?

Then some Arabic diacritics which are missing for sure...

Quote from a book titled "Colloquial Urdu: The Complete Course for Beginners"
by Tej K. Bhatia and Ashok Koul (Routledge 2000, ISBN 0-415-13540-0), page 195:

"vāo [= U+0648] is a non-connector and has no separate positional shapes. In
its
initial form and after a vowel ā it represents only v/w sound. It may represent
three vowels, o, ū and au. But to distinguish vowels ū from au, it may occur
with
the signs ulTā pesh ([an upside-down damma U+064F]) zabar (ـَ) respectively"

This "ulTā pesh" (‮الٹا پیش‬) is considered separate from damma
("pesh - ‮پیش‬" in
Urdu), as they represent a long and a short "u", respectively. This can be seen
also on page 10, where there's a chart showing both the diacritics having
different semantics.

On a website teaching how to read the Arabic in the Qur'an
http://www.as-sidq.org/durusulQuran/articles/upright.htm
this diacritic is called "inverted Damma", and it's an alternative way to
represent the long vowel "ū" (usually represented with the letter waw U+0648,
not a diacritic). This is parallel to alif U+0627 sometimes used to be
represented with a superscript alif U+0670 (= "dagger alif, upright fathah") and
ya U+064A represented with "an upright kasra", i.e. a subscript alif. The latter
one is also missing from Unicode.

On page 25 of the User Manual for ArabTeX 3.09 (TeX package for typesetting
texts written in Arabic script, be it Arabic, Persian, Urdu, Sindhi etc.)
ftp://ftp.informatik.uni-stuttgart.de/pub/arabtex/doc/arabdoc.pdf
these diacritics are also mentioned as "defective notation of ā, ī, ū". So,
semantically they could perhaps be considered as variants of the letters alif,
ya and waw, but as the previous are non-spacing diacritics and the latter full
letters, this would be an extremely odd approach technically. And since one of
these, "defective ā", is already encoded (as it's used in modern texts, too),
why not encode the two other vowels, too?

Also on the website I referred to, there's something about the Koranic
annotation
marks (not all but some):
http://www.as-sidq.org/durusulQuran/Lesson30/30-1.htm
It says (or it doesn't actually say so, but this is what I concluded the author
meant) that when e.g. a word ends with a kasratan U+064D and the next word
begins with two consonants (i.e. in practice, a word beginning with the definite
article ‮ال‬ where the lam is assimilated to the next consonant), the
binding
vowel between the /n/ of the ending and the next word is to be written under a
small letter nun written below the alif of the next word. Is this small nun
missing from Unicode or is it to be considered as a glyph variant of U+06E8,
small letter high noon? So the previous is below the alif while the latter above
a letter (I'm not very familiar with these annotation marks in any case; I don't
know even if the semantics differ).

Then something about writing in the Syriac script... Yesterday I was looking for
a TeX package which could be used to write Syriac. I didn't find anything
non-commercial (is there anything?), but I did find a manual to a package called
Sabra, it being a part of a larger system called ScholarTeX. You can see it
yourself at
http://tex.loria.fr/fontes/syriac.ps.gz
There on pages 14-16 the input method for writing Garshuni (= Arabic written in
Syriac script) is described. Most of this causes no problems, as the letters
themselves are all encoded in Unicode and the diacritics are taken from the
Arabic block. There's, however, one thing which can't be encoded in Unicode:
Syriac alaph with an Arabic wasla. This is quite odd, because wasla is a
relatively common diacritic in Arabic, and it could well be used in Garshuni
documents, as well. At the moment the normal combination, Arabic alef + wasla
can
be written, as it's encoded as a separate codepoint in U+0671. But for some
strange reason, this can't be decomposed! There's no other occurence of wasla in
Unicode, except for the two presentation forms of this alef wasla in Arabic
Presentation Forms-A (U+FB50, U+FB51). What should be done to the problem?
Encode a separate character "Syriac alaph wasla", or add a non-spacing diacritic
"Arabic wasla" and make decomposing of U+0671 possible? A further reason for the
latter solution would be textbooks teaching the Arabic script. There one might
easily find a wasla written on its own, without a base letter (actually I have a
grammar of Urdu in Finnish where the author has done like this, but due to the
lack of a computer standard, he's drawn it by hand!).

In the same manual file of the TeX package Sabra, there's a mention about a
vowel system of Syriac which isn't considered anyway in Unicode. In the document
the system is called "Jacob of Edessa vowels", but it's different from the one
mentioned in the Unicode 3.0 book. Its input method in this TeX package is
described on page 12, and there are samples written in it on pages 18-20. Even
though the "Greek vowel system" (e.g. U+0730, U+0731, U+0733 etc.) seems to have
been developed from it, it should be considered separate, I think, as the vowels
in it are not diacritics but full letters. Actually this problem is very similar
to the above-mentioned with old vs. new representations of the long vowels in
Arabic. Any comments?

Best regards!

----------------------------------
E-Mail: Miikka-Markus Alhonen <Miikka-Markus.Alhonen@tigatieto.com>
Date: 23-Sep-01
Time: 00:25:30

This message was sent by XFMail
----------------------------------

Previous message: From Net Link: "Re: [OT] Roman numeral arithmetic (was: Re: [lojban] (from lojban-beginners) pi'e)"
Next in thread: Charlie Jolly: "[OT] What happened to the OpenType list?"
Reply: Charlie Jolly: "[OT] What happened to the OpenType list?"
Reply: Michael Everson: "Re: Missing Arabic and Syriac characters in Unicode"
Reply: Majid Bhurgri: "Re: Missing Arabic and Syriac characters in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Sun Sep 23 2001 - 16:15:12 EDT