Transcoding Tamil in the presence of markup

From: Peter Jacobi (peter_jacobi@gmx.net)
Date: Sat Dec 06 2003 - 13:39:29 EST

  • Next message: Peter Kirk: "Re: Compression through normalization"

    Dear All,

    I am attempting transcoding Tamil text (in legacy 8-bit encodings, which
    are in visual glyph order, being heirs of the Tamil typewriter) into Unicode
    (which uses 'logical' order invented by ISCII):
    http://www.jodelpeter.de/i18n/tamil/xref-uc.htm

    When I thought, my converter was ready, I had a severe collision
    with reality, as I tried it on some webpages.

    The problem: in the legacy encoding you can style individual characters,
    which not only breaks my simple converter, but which may have no
    good equivalent in Unicode anyway. See this example:
    (all legacy encoded Tamil is shown using C-style escape, Unicode Tamil as
    NCR)

    Converting unstyled text
    from TSCII
     lA \xC4\xA1
     le \xA7\xC4
     lo \xA7\xC4\xA1
    to Unicode
     lA லா
     le லெ
     lo லொ

    Now the consonant l should get a distinct color:
    In TSCII:
     lA <span style='color:#00f'>\xC4</span>\xA1
     le \xA7<span style='color:#00f'>\xC4</span>
     lo \xA7<span style='color:#00f'>\xC4</span>\xA1

    In Unicode:
     lA <span style='color:#00f'>&#x0BB2;</span>&#x0BBE;
     le <span style='color:#00f'>&#x0BB2;</span>&#x0BC6;
     lo <span style='color:#00f'>&#x0BB2;</span>&#x0BCA;

    It is easy to see, that simple n:m mapping cannot make this conversion.
    It is not that easy to judge whether this is the desired conversion at all.
    And what should the receiving software should do with it.

    Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 the
    style expands to the entire orthographic syllable.
    Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
    TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm

    After seeing this effect at its source, it's now clear why you can't style
    individual
    Tamil characters in a word processor, when using Unicode (whereas
    you can do so, in legacy encodings).

    It's hard to promote Unicode, when things that have worked in the past,
    stop working.

    Any insights?

    Regards,
    Peter Jacobi

    -- 
    +++ GMX - die erste Adresse für Mail, Message, More +++
    Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net
    


    This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 14:28:39 EST