Re: What constitutes "character"? New Problem

From: James Kass (jameskass@worldnet.att.net)
Date: Thu Nov 22 2001 - 20:21:37 EST

Previous message: Michael \(michka\) Kaplan: "Re: Surrogates Question"
Maybe in reply to: James Kass: "Re: What constitutes "character"? New Problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Like Mark Davis indicates, there is some exceptional behavior with
some of the Indic script transliterations.

Assamese, for instance, uses a couple of unique glyphs. Tamil has
a smaller consonant repertoire than the rest of the Indic scripts,
and so forth.

I tried to pick an example for the test that wouldn't be impacted
by this too much. (Except for the Tamil, of course.)

If anyone is interested in ISCII, here is a page with some background
(and useful graphics):
http://www.cwi.nl/~dik/english/codes/indic.html

Quoting from the page linked above,

     "Because the structure of these scripts is so similar a single
      coding can be applied to all of them, immediately providing
      transliteration between the scripts (see however below)."

and

     "Contrary to most codes given here, the Indian codes do not
      map directly to displayed glyphs, but rather give a structural
      coding. For instance, the vowel marks (given in the chart
      together with the position were the consonants to which they
      are applied go) in the code always follow the consonant code;
      in display this is not always true."

Here is a page exploring certain issues with ISCII/Unicode encoding,
which also has some good background information about ISCII and
Unicode for Indic scripts:

http://acharya.iitm.ac.in/multi_sys/uni_iscii.html

Looking forward to ICU 2.0

Best regards,

James Kass.

----- Original Message -----
From: "Mark Davis" <mark.davis@macchiato.com>
To: "James Kass" <jameskass@worldnet.att.net>; "Unicode List" <unicode@unicode.org>
Cc: "Arjun Aggarwal" <mrasool@sancharnet.in>
Sent: Wednesday, November 21, 2001 7:15 AM
Subject: Re: What constitutes "character"? New Problem

> It's not quite that easy. We are coming out soon with indic transliterators
> in ICU 2.0, so you can see how we did it. Some of the issues are:
>
> - there are many letters that don't have correspondences with other scripts
> - edge cases that transliterate differently
> - ISCII romanization is not fully reversible (we had to augment it by adding
> extra accents in certain cases to distinguish them).
>
> Mark
> —————
>
> Ὀλίγοι ἔμφονες πολλῶν ἀφρόνων φοβερώτεροι — Πλάτωνος
> [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
>
> http://www.macchiato.com
> ----- Original Message -----
> From: "James Kass" <jameskass@worldnet.att.net>
> To: "Unicode List" <unicode@unicode.org>
> Cc: "Arjun Aggarwal" <mrasool@sancharnet.in>
> Sent: Wednesday, November 21, 2001 00:39
> Subject: Re: What constitutes "character"? New Problem
>
>
> > Hello,
> >
> > Tex Texin has a demo page up to display names of celebrities
> > from around the world in their native scripts.
> >
> > (Please see:
> > http://www.geocities.com/i18nguy/unicode-example.html )
> >
> > I wonder if you're familiar with ISCII, the Indian national
> > computer encoding standard upon which the Indic script
> > encoding in Unicode is based.
> >
> > One of the advantages of following this scheme is supposed
> > to be the ability to easily transliterate between various
> > Indic scripts.
> >
> > Just to see how easy this was and if it works, took the name
> > "Madhari Dixit" from Tex Texin's page as submitted by Yaap
> > Raaf.
> >
> > Got the decimal code points for each of the Devanagari characters
> > and put them in a database. Made nine copies of that database,
> > each time adding the number 128 to the code point values.
> > Then merged the ten databases into one and generated a
> > text HTML file.
> >
> > The results follow in UTF-8:
> >
> > माधुरी दिछित
> > মাধুরী দিছিত
> > ਮਾਧੁਰੀ ਦਿਛਿਤ
> > માધુરી દિછિત
> > ମାଧୁରୀ ଦିଛିତ
> > மா஧ுரீ ஦ி஛ித
> > మాధురీ దిఛిత
> > ಮಾಧುರೀ ದಿಛಿತ
> > മാധുരീ ദിഛിത
> > ථ඾ටශධව ඦ඿ඛ඿ඤ
> >
> > Well, I don't have all the fonts needed here, but, except from
> > the Tamil (which lacks some consonants) and the Sinhala (which
> > I can't see at all), it looks to work and it's pretty easy to do.
> >
> > The Indian committees responsible for the ISCII standard
> > obviously put a great deal of thought and effort into the job.
> >
> > If half letters were encoded separately for Devanagari, people
> > have noted on this list that existing applications would be broken.
> > This ability to easily transliterate would be the first to go away.
> > Searching and indexing would probably be the next.
> >
> > Hoping this is helpful.
> >
> > Best regards,
> >
> > James Kass.
> >
> > ----- Original Message -----
> > From: "Arjun Aggarwal" <mrasool@sancharnet.in>
> > To: <jameskass@worldnet.att.net>
> > Cc: <unicode@unicode.org>
> > Sent: Sunday, November 18, 2001 7:54 AM
> > Subject: Re: What constitutes "character"? New Problem
> >
> >
> >
> >
> >
>

Previous message: Michael \(michka\) Kaplan: "Re: Surrogates Question"
Maybe in reply to: James Kass: "Re: What constitutes "character"? New Problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Thu Nov 22 2001 - 20:07:19 EST