Re: discontent about Indic scripts and Unicode

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Tue Sep 18 2001 - 17:02:45 EDT


This is the same problem that was discussed extensively for Tamil at TI2001
in Kuala Lampur last month. Basically, it boils down to three problems:

1) Most of the people involved do not understand Unicode or how it works.
2) Most of the people involved expect natural language processing to be a
feature that any solution ought to support (thus making Unicode inadequate
for a valid purpose).
3) The people who do not have problems with #1 or #2 are not as loud as the
people who do -- which contributes to the inertia.

These two pages conveniently take issues # 1 and #2 and handle them
separately. Very thoughtful of the author....

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

----- Original Message -----
From: "Hietaniemi Jarkko (NRC/Boston)" <jarkko.hietaniemi@nokia.com>
To: <unicode@unicode.org>
Sent: Tuesday, September 18, 2001 1:03 PM
Subject: discontent about Indic scripts and Unicode

> I happened across these links:
>
> http://acharya.iitm.ac.in/multi_sys/exist_codes.html
> http://acharya.iitm.ac.in/multi_sys/uni_iscii.html
>
> which do contain a nice discussion about ISCII but then they
> discuss Unicode in, ummm, somewhat negative terms.
>
> Myself knowing next to nothing about Indic scripts it would be nice
> to hear comments from someone who does know.
>
> I do notice some misunderstanding about Unicode in the above links,
> quoting from the first one:
>
> > Unicode, besides permitting an 8 bit representation for each language,
> adds
> > an 8 bit identifier as a most significant byte to make the code 16
> bits.
> > Data processing software using Unicode will be able to identify the
> Language
> > of the text for each character and use appropriate fonts to display
> them.
> >
> > Technically, Unicode can handle 256 different languages but in
> practice,
> > this number is significantly smaller. Unicode has allowed nearly 24000
> characters
> > of Chinese, Japanese and Korean scripts to be included in a single
> set.
> > Currently fewer than a hundred languages are included in the Unicode.
>
> >
> > Even though it is a sixteen bit code, Unicode usually provides for
> about
> > 128 characters for each language.
>
> A messy conflation of "languages" and "characters" and "fonts". Not to
> forget
> "sixteen bit code".
>
> The web site has been updated in July.
>
>



This archive was generated by hypermail 2.1.2 : Tue Sep 18 2001 - 16:36:16 EDT