From: Gregg Reynolds (unicode@arabink.com)
Date: Thu Jun 30 2005 - 10:13:36 CDT
N. Ganesan wrote:
> Gregg Reynolds (unicode@arabink.com) wrote
>
>>You are not alone in thinking Unicode does not
>>serve your language community, but don't forget
>>it was never Unicode's intention to serve
>>language communities. It's just a character
>>encoding, not a language encoding. Unicode
>>happens to also do serious damage to the entire
>>world of right-to-left languages such as Arabic (IMO),
>>but it had no choice, given that it was constrained
>>to adopt legacy encodings. No point in whining
>>about that. And it is probably better than what
>>we had before. Still, it is up to than language
>>community to decide to do something better.
>
>
> For resources and other practical difficulties,
> I think Unicode will be the only one 16-bit
> encoding for Tamil for a long time. Haven't even
> heard of someone coming up with competition.
> But Tamil script being a script with only
> non-conjuncts (unlike eg. Devanagari or
> Tamil Grantha scripts), many 8-bit glyph based
> encodings still exist in the web. But
> they are not searcheable via Google & so on.
> So, some 500+ blogs operate exclusively
> in Unicode.
>
> What about Arabic script? The Middle East
> awash with funds and resources, and the script is in a
> wide area by lots of people. If
> "Unicode happens to also do serious damage
> to the entire world of right-to-left languages",
> is there a competition? Any 16-bit encodings
> for Arabic script other than Unicode?
Hi,
I'm not aware of any 16-bit encodings for Arabic other then Unicode.
There are plenty of 7 or 8 bit encoding and transliteration schemes, but
most of them use more or less the same character repertoire as Unicode.
(Note that ascii-based transliteration schemes don't bother with
bidirectionality of number strings but have been quite useful, at least
to the scholarly community, for a long time.) 256 characters is
adequate to cover Arabic completely, so 8 bits is enough.
The reason (or my reason, anyway) for experimenting with alternative
encoding designs is not because Unicode is incapable of encoding the
graphic forms of text, but because it rules out some kinds of
"grammatical" semantics (for lack of a better term) that can easily be
associated with characters, and that allow for much more powerful text
processing. For example, traditional Arabic grammar distinguishes many
different "kinds" of alef. They all use the alef letterform encoded by
Unicode, but they have different functions, some graphotactic, some
phonological, maybe some others. Obviously they could all be encoded
with different codepoints that use the same glyph; just as obviously
this would be outside the scope of Unicode. However there are other
cases where the dividing line is not so clear. The fun thing about
Arabic is that various kinds of grammatical semantics can be attached to
single characters; you can't really do that in English.
(By the way, there's the real contrary to plaintext: character codes
that denote grammatical semantics rather than just graphemic semantics.)
In any case, by piggy-backing on a widely implemented encoding like
latin-1, you can encode text using an experimental design and use
existing tools to work with it in various ways, make it available to
others, etc., so you can find out what really works and is useful to
others, rather than speculating. So unproductive polemics on the
Unicode list can be avoided. ;)
Thanks,
gregg
This archive was generated by hypermail 2.1.5 : Thu Jun 30 2005 - 10:16:41 CDT