Transcoding Tamil in the presence of markup

From: Peter Jacobi (peter_jacobi@gmx.net)
Date: Sat Dec 06 2003 - 13:39:29 EST

Next message: Peter Kirk: "Re: Compression through normalization"

Previous message: Doug Ewell: "Re: Compression through normalization"
Next in thread: Doug Ewell: "Re: Transcoding Tamil in the presence of markup"
Reply: Doug Ewell: "Re: Transcoding Tamil in the presence of markup"
Reply: Christopher John Fynn: "Re: Transcoding Tamil in the presence of markup"
Reply: John Delacour: "Re: Transcoding Tamil in the presence of markup"
Maybe reply: Peter Jacobi: "RE: Transcoding Tamil in the presence of markup"
Maybe reply: jcowan@reutershealth.com: "Re: Transcoding Tamil in the presence of markup"
Maybe reply: Peter Constable: "RE: Transcoding Tamil in the presence of markup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Dear All,

I am attempting transcoding Tamil text (in legacy 8-bit encodings, which
are in visual glyph order, being heirs of the Tamil typewriter) into Unicode
(which uses 'logical' order invented by ISCII):
http://www.jodelpeter.de/i18n/tamil/xref-uc.htm

When I thought, my converter was ready, I had a severe collision
with reality, as I tried it on some webpages.

The problem: in the legacy encoding you can style individual characters,
which not only breaks my simple converter, but which may have no
good equivalent in Unicode anyway. See this example:
(all legacy encoded Tamil is shown using C-style escape, Unicode Tamil as
NCR)

Converting unstyled text
from TSCII
lA \xC4\xA1
le \xA7\xC4
lo \xA7\xC4\xA1
to Unicode
lA லா
le லெ
lo லொ

Now the consonant l should get a distinct color:
In TSCII:
lA \xC4\xA1
le \xA7\xC4
lo \xA7\xC4\xA1

In Unicode:
lA லா
le லெ
lo லொ

It is easy to see, that simple n:m mapping cannot make this conversion.
It is not that easy to judge whether this is the desired conversion at all.
And what should the receiving software should do with it.

Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 the
style expands to the entire orthographic syllable.
Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm

After seeing this effect at its source, it's now clear why you can't style
individual
Tamil characters in a word processor, when using Unicode (whereas
you can do so, in legacy encodings).

It's hard to promote Unicode, when things that have worked in the past,
stop working.

Any insights?

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Next message: Peter Kirk: "Re: Compression through normalization"
Previous message: Doug Ewell: "Re: Compression through normalization"
Next in thread: Doug Ewell: "Re: Transcoding Tamil in the presence of markup"
Reply: Doug Ewell: "Re: Transcoding Tamil in the presence of markup"
Reply: Christopher John Fynn: "Re: Transcoding Tamil in the presence of markup"
Reply: John Delacour: "Re: Transcoding Tamil in the presence of markup"
Maybe reply: Peter Jacobi: "RE: Transcoding Tamil in the presence of markup"
Maybe reply: jcowan@reutershealth.com: "Re: Transcoding Tamil in the presence of markup"
Maybe reply: Peter Constable: "RE: Transcoding Tamil in the presence of markup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 14:28:39 EST