From: Christopher John Fynn (cfynn@gmx.net)
Date: Sat Dec 06 2003 - 16:55:01 EST
In Unicode U+0BBE, U+0BC6 and U+0BCA are all dependent vowel signs
IE is probably treating a base character and any dependent vowels as a single
unit. Since in some fonts a base character + combining vowel mark might be
displayed by a single ligature glyph, it makes sense to apply the formatting of
a base character to any dependant combining characters as well.
In Mozilla you may be completely breaking the font lookups by separately
formatting the different parts of a conjunct.
In legacy glyph based Tamil encodings there was a simple one-to-one
correspondence characters and glyphs so it is straightforward to apply
different formatting to different characters.
-- Christopher J. Fynn ----- Original Message ----- From: "Peter Jacobi" <peter_jacobi@gmx.net> To: <unicode@unicode.org> Sent: Saturday, December 06, 2003 6:39 PM Subject: Transcoding Tamil in the presence of markup > Dear All, > > I am attempting transcoding Tamil text (in legacy 8-bit encodings, which > are in visual glyph order, being heirs of the Tamil typewriter) into Unicode > (which uses 'logical' order invented by ISCII): > http://www.jodelpeter.de/i18n/tamil/xref-uc.htm > > When I thought, my converter was ready, I had a severe collision > with reality, as I tried it on some webpages. > > The problem: in the legacy encoding you can style individual characters, > which not only breaks my simple converter, but which may have no > good equivalent in Unicode anyway. See this example: > (all legacy encoded Tamil is shown using C-style escape, Unicode Tamil as > NCR) > > Converting unstyled text > from TSCII > lA \xC4\xA1 > le \xA7\xC4 > lo \xA7\xC4\xA1 > to Unicode > lA லா > le லெ > lo லொ > > Now the consonant l should get a distinct color: > In TSCII: > lA <span style='color:#00f'>\xC4</span>\xA1 > le \xA7<span style='color:#00f'>\xC4</span> > lo \xA7<span style='color:#00f'>\xC4</span>\xA1 > > In Unicode: > lA <span style='color:#00f'>ல</span>ா > le <span style='color:#00f'>ல</span>ெ > lo <span style='color:#00f'>ல</span>ொ > > It is easy to see, that simple n:m mapping cannot make this conversion. > It is not that easy to judge whether this is the desired conversion at all. > And what should the receiving software should do with it. > Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 the > style expands to the entire orthographic syllable. > Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm > TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm > > After seeing this effect at its source, it's now clear why you can't style > individual > Tamil characters in a word processor, when using Unicode (whereas > you can do so, in legacy encodings). > > It's hard to promote Unicode, when things that have worked in the past, > stop working. > > Any insights? > > Regards, > Peter Jacobi > > > > > -- > +++ GMX - die erste Adresse für Mail, Message, More +++ > Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net > > >
This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 17:48:18 EST