From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Jul 22 2007 - 13:08:23 CDT
Are you making here a proposal to encode the Arabic
archaeographemes/archeographemes (or “archigraphemes” as you call them, but
I’m not sure this is a correct term for English, as “archi-“ is another
prefix with another meaning to mark emphasis, stronger than “super-” and
quite similar to “hyper-“), i.e. the skeletons (without the normally
required markers), and possibly too, the markers themselves, separately ?
If these were encoded in some extended Arabic block, I’m not sure it will
cause severe havoc. Even for searches over the Internet or in plain-text
documents, the morphological similarities between otherwise unrelated modern
letters can be analyzed by some custom “decomposition” using PUAs (for now,
because these units are not encoded separately), or using a tailored
collation…
As this will be needed for palaeographic studies, most of the existing texts
will not have to be re-encoded and changed, even if they appear to be really
composite letters. Anyway, the Unicode stability prohibits “decomposing”
them using any normalized decomposed forms ; this can still be done
privately or through local collation algorithms, built specifically for
paleographers. There should be no change to existing Arabic texts, and the
letters should not be decomposed in standard texts.
Anyway, the issue is quite similar with other letters in alphabetic scripts
: the ae and oe ligatures in Latin can be decomposed in some languages, and
they still should be decomposed when doing morphological analysis, even in
today’s modern texts (at least in French), even if they should not be
decomposed this way in standard texts (but it’s true that Unicode provided
compatibility decompositions for them, something that was not done for
Arabic letters with markers, and that can’t be done now)…
_____
De : Thomas Milo [mailto:t.milo@chello.nl]
Envoyé : jeudi 19 juillet 2007 22:09
À : Simon Montagu; verdy_p@wanadoo.fr
Cc : 'John Hudson'; unicode@unicode.org; 'Hebrew List'
All these observations about asynchronic text notation (text recorded in
phases) using independent character subsets (archigraphemic skeleton,
disambiguation dots, vowel marks) even across nominally different writing
systems also pertain to Arabic. Particularly regarding the text transmission
of the Holy Qur'an this is very relevant.
HQ Codices of the first few centuries were written without consonant markers
(originally not points but small nib imprints) and vowel disambiguation
marks (which were points in the earliest Arabic script). Editors
(contemporary or later) added the consonant disambiguation markers and vowel
signs (personal communication from Yasin Dutton during the Corpus Coranicum
Workshop organized by the European Science Foundation in 2005, Berlin).
http://www.esf.org/activities/exploratory-workshops/humanities-sch/2005/corp
us-coranicum-exploring-the-textual-beginnings-of-the-quran.html
To this day, this horizontal segmentation remains the deep structure of
Arabic. Understanding it helps to deal with its generative power to combine
any marker with any basic letter (i.e., archigrapheme). Hebrew, Aramaic and
Arabic do occur in various mixes along this horizontal segmentation, which
provides an additional argument for dealing with the horizontal segmentation
of Arabic and related scripts.
Unicode's present fixation with vertical segmentation (leading to the
irrelevant concept of ligatures) in Arabic and national subsets leads to
1. uneconomical proliferation of Arabic code points consisting of generic
archigraphemes and generic markers
2. serious problems in digitizing historical and even contemporary texts.
For examples of see my Unicode Tutorial, page 7 for examples of
Unicode-induced ambiguity in encoding exactly identical Arabic character
groups and on page for examples of 12 the resulting every-day chaos:
www.decotype.com/publications/unicode-tutorial.pdf
This archive was generated by hypermail 2.1.5 : Sun Jul 22 2007 - 13:11:49 CDT