Re: Transcoding Tamil in the presence of markup

From: Christopher John Fynn ([email protected])
Date: Sat Dec 06 2003 - 16:55:01 EST

Next message: Philippe Verdy: "RE: Compression through normalization"

Previous message: Mark Davis: "Re: Compression through normalization"
In reply to: Peter Jacobi: "Transcoding Tamil in the presence of markup"
Next in thread: Philippe Verdy: "RE: Transcoding Tamil in the presence of markup"
Reply: Philippe Verdy: "RE: Transcoding Tamil in the presence of markup"
Reply: Peter Jacobi: "Re: Transcoding Tamil in the presence of markup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In Unicode U+0BBE, U+0BC6 and U+0BCA are all dependent vowel signs

IE is probably treating a base character and any dependent vowels as a single
unit. Since in some fonts a base character + combining vowel mark might be
displayed by a single ligature glyph, it makes sense to apply the formatting of
a base character to any dependant combining characters as well.

In Mozilla you may be completely breaking the font lookups by separately
formatting the different parts of a conjunct.

In legacy glyph based Tamil encodings there was a simple one-to-one
correspondence characters and glyphs so it is straightforward to apply
different formatting to different characters.

--
Christopher J. Fynn
----- Original Message ----- 
From: "Peter Jacobi" <[email protected]>
To: <[email protected]>
Sent: Saturday, December 06, 2003 6:39 PM
Subject: Transcoding Tamil in the presence of markup
> Dear All,
>
> I am attempting transcoding Tamil text (in legacy 8-bit encodings, which
> are in visual glyph order, being heirs of the Tamil typewriter) into Unicode
> (which uses 'logical' order invented  by ISCII):
> http://www.jodelpeter.de/i18n/tamil/xref-uc.htm
>
> When I thought,  my converter was ready, I had a severe collision
> with reality, as I tried it on some webpages.
>
> The problem: in the legacy encoding you can style individual characters,
> which not only breaks my simple converter, but which may have no
> good equivalent in Unicode anyway. See this example:
> (all legacy encoded Tamil is shown using C-style escape, Unicode Tamil as
> NCR)
>
> Converting unstyled text
> from TSCII
>  lA \xC4\xA1
>  le \xA7\xC4
>  lo \xA7\xC4\xA1
> to Unicode
>  lA &#x0BB2;&#x0BBE;
>  le &#x0BB2;&#x0BC6;
>  lo &#x0BB2;&#x0BCA;
>
> Now the consonant l should get a distinct color:
> In TSCII:
>  lA <span style='color:#00f'>\xC4</span>\xA1
>  le \xA7<span style='color:#00f'>\xC4</span>
>  lo \xA7<span style='color:#00f'>\xC4</span>\xA1
>
> In Unicode:
>  lA <span style='color:#00f'>&#x0BB2;</span>&#x0BBE;
>  le <span style='color:#00f'>&#x0BB2;</span>&#x0BC6;
>  lo <span style='color:#00f'>&#x0BB2;</span>&#x0BCA;
>
> It is easy to see, that simple n:m mapping cannot make this conversion.
> It is not that easy to judge whether this is the desired conversion at all.
> And what should the receiving software should do with it.
> Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 the
> style expands to the entire orthographic syllable.
> Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
> TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm
>
> After seeing this effect at its source, it's now clear why you can't style
> individual
> Tamil characters in a word processor, when using Unicode (whereas
> you can do so, in legacy encodings).
>
> It's hard to promote Unicode, when things that have worked in the past,
> stop working.
>
> Any insights?
>
> Regards,
> Peter Jacobi
>
>
>
>
> -- 
> +++ GMX - die erste Adresse für Mail, Message, More +++
> Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net
>
>
>

Next message: Philippe Verdy: "RE: Compression through normalization"
Previous message: Mark Davis: "Re: Compression through normalization"
In reply to: Peter Jacobi: "Transcoding Tamil in the presence of markup"
Next in thread: Philippe Verdy: "RE: Transcoding Tamil in the presence of markup"
Reply: Philippe Verdy: "RE: Transcoding Tamil in the presence of markup"
Reply: Peter Jacobi: "Re: Transcoding Tamil in the presence of markup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 17:48:18 EST