RE: Help with some Arabic letters

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Wed Dec 15 1999 - 11:20:16 EST


Hello Patrick,

> -----Original Message-----
> From: Patrick Andries [mailto:pandries@iti.qc.ca]
> Sent: Wednesday, December 15, 1999 9:25 AM
>
> But my question is slighly different : how is the Uyghur
> letter represented
> by the should-not-be-pronounced Unicode letter U+06C7 pronounced ?
> Do I thus escape the stake of Unicode orthodoxy.
>
> > 2. Unicode is especially weak in the area of arabiform
> "characters".
> > Your query illustrates this weakness quite nicely. U+06C7, "ARABIC
> > LETTER U", as Unicode 2.x calls it, looks suspiciously like U+0648,
> > U+064F.
>
> True.
>
> > In "Arabic", this is perfectly acceptable and indeed occurs
> > frequently. A user would be perfectly justified in
> spelling a word like
> > "wujuwd" with U+06C7 as the first codepoint.
>
> Hmm. Interesting problem. I suppose in practice this will not
> happen since
> at input, if a user is Arabic, he will key-in a waw followed
> by damma while
> a Uyghur will use a U+06C7 (a single key on a traditional keyboard ?).
>

I'm a little less sanguine about that. It depends on the software, and on
how (and whether) local SW industries develop. Suppose a Uighur speaker is
using Arabic software, or that a software vendor Uighurizes an Arabic
package and decides its good enough to allow waw+damma as U+06C7, following
the same logic that says its up to higher levels to decipher Slovak 'ch'.
If the world of Arabiform w-languages were as software-rich as the world of
Latinate w-languages, I think we would see lots of different ways of using
Unicode to encode the same thing. No matter how the data is entered, we
still have the problem of how to interpret it. Since Unicode doesn't want
to get involved in language-specific encodings (at least not outside of a
core set of languages), what is really needed is another standard (or a set
of them; at least one per language) for which Unicode is a resource, which
would indicate how various Unicode strings are to be interpreted in various
languages.

>
> > 3. So U+06C7, in written Arabic, would denote something like "woo".
>
> But this is then a glyph representation since there is no
> single letter in
> Arabic ressembling U+06C7, while it looks like in Uyghur this
> represents a
> single sound and letter.
>
> Except that U+06C8 as [y] (ü in German) will seem to fill a
> need in Turkic
> languages: a way to write "ü" a sound inexistant in Arabic or Farsi.

Right, that makes sense. But Unicode is quite inconsistent in this area -
witness the Slovak 'ch' brouhaha. Similar considerations arise in various
aspirated consonants in Arabiform w-languages. I believe in Urdu spelling
such "letters" as consonant+soft h is acceptable, but in other languages it
looks to me like the consonant+soft h should be treated as a distinct single
character - like 'ch' in Slovak.

> > So a search for, e.g., kitAbu, should find kitAb(un), where
> > (un) symbolizes U+064C; that is, 'u' modified by tanween
> (noonation).
> > And a search for kitAb(un) should arguably find kitAbuN, where 'N'
> > symbolizes the as-yet undefined tanween codepoint, as well
> as kitAbuu,
> > where two consecutive damma marks makes a damma+tanween. But unless
> > I've misread the standard (entirely possible), there is nothing in
> > Unicode that provides for this.
>
> I'm not sure how this could be solved at the level of Unicode.
>

I think compositional mappings would do it. Then the semantics would be
there, in the data. Of course, it might be language specific, since such
mappings may not apply outside of the Arabic w-language. But extra stuff in
Unicode seems like a better idea than not enough stuff.

-gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT