Re: Transcoding Tamil in the presence of markup

From: Christopher John Fynn (cfynn@gmx.net)
Date: Sat Dec 06 2003 - 16:55:01 EST

  • Next message: Philippe Verdy: "RE: Compression through normalization"

    In Unicode U+0BBE, U+0BC6 and U+0BCA are all dependent vowel signs

    IE is probably treating a base character and any dependent vowels as a single
    unit. Since in some fonts a base character + combining vowel mark might be
    displayed by a single ligature glyph, it makes sense to apply the formatting of
    a base character to any dependant combining characters as well.

    In Mozilla you may be completely breaking the font lookups by separately
    formatting the different parts of a conjunct.

    In legacy glyph based Tamil encodings there was a simple one-to-one
    correspondence characters and glyphs so it is straightforward to apply
    different formatting to different characters.

    --
    Christopher J. Fynn
    ----- Original Message ----- 
    From: "Peter Jacobi" <peter_jacobi@gmx.net>
    To: <unicode@unicode.org>
    Sent: Saturday, December 06, 2003 6:39 PM
    Subject: Transcoding Tamil in the presence of markup
    > Dear All,
    >
    > I am attempting transcoding Tamil text (in legacy 8-bit encodings, which
    > are in visual glyph order, being heirs of the Tamil typewriter) into Unicode
    > (which uses 'logical' order invented  by ISCII):
    > http://www.jodelpeter.de/i18n/tamil/xref-uc.htm
    >
    > When I thought,  my converter was ready, I had a severe collision
    > with reality, as I tried it on some webpages.
    >
    > The problem: in the legacy encoding you can style individual characters,
    > which not only breaks my simple converter, but which may have no
    > good equivalent in Unicode anyway. See this example:
    > (all legacy encoded Tamil is shown using C-style escape, Unicode Tamil as
    > NCR)
    >
    > Converting unstyled text
    > from TSCII
    >  lA \xC4\xA1
    >  le \xA7\xC4
    >  lo \xA7\xC4\xA1
    > to Unicode
    >  lA &#x0BB2;&#x0BBE;
    >  le &#x0BB2;&#x0BC6;
    >  lo &#x0BB2;&#x0BCA;
    >
    > Now the consonant l should get a distinct color:
    > In TSCII:
    >  lA <span style='color:#00f'>\xC4</span>\xA1
    >  le \xA7<span style='color:#00f'>\xC4</span>
    >  lo \xA7<span style='color:#00f'>\xC4</span>\xA1
    >
    > In Unicode:
    >  lA <span style='color:#00f'>&#x0BB2;</span>&#x0BBE;
    >  le <span style='color:#00f'>&#x0BB2;</span>&#x0BC6;
    >  lo <span style='color:#00f'>&#x0BB2;</span>&#x0BCA;
    >
    > It is easy to see, that simple n:m mapping cannot make this conversion.
    > It is not that easy to judge whether this is the desired conversion at all.
    > And what should the receiving software should do with it.
    > Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 the
    > style expands to the entire orthographic syllable.
    > Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
    > TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm
    >
    > After seeing this effect at its source, it's now clear why you can't style
    > individual
    > Tamil characters in a word processor, when using Unicode (whereas
    > you can do so, in legacy encodings).
    >
    > It's hard to promote Unicode, when things that have worked in the past,
    > stop working.
    >
    > Any insights?
    >
    > Regards,
    > Peter Jacobi
    >
    >
    >
    >
    > -- 
    > +++ GMX - die erste Adresse für Mail, Message, More +++
    > Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net
    >
    >
    >
    


    This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 17:48:18 EST