Re: Transcoding Tamil in the presence of markup

From: Jungshik Shin (jshin@mailaps.org)
Date: Sun Dec 07 2003 - 08:16:38 EST

  • Next message: Peter Jacobi: "Fwd: Re: Fwd: Re: Transcoding Tamil in the presence of markup"

    On Sat, 6 Dec 2003, Doug Ewell wrote:

    > Peter Jacobi <peter underscore jacobi at gmx dot net> wrote:
    >
    > > Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5
    > > the style expands to the entire orthographic syllable.
    > > Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
    > > TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm
    >
    > BTW, your "Unicode test page" is marked:
    >
    > <meta http-equiv="Content-Type"
    > content="text/html; charset=ISO-8859-1">

      Peter uses NCRs so that it doesn't matter (although I prefer to
    tag the page as 'UTF-8', even in that case), does it? Anyway, he
    should have used 'lang' tag to help browsers pick up fonts. In two
    pages above, simply adding 'lang="ta"' to <table ....> would suffice.
    In xref-uc.htm, if you want a fine-grained control, he can just globally
    replace '<span class="glyph">&#....</span>' with '<span lang="ta"
    class="glyph">&#....</span>'.

    > while your TSCII test page is marked "x-user-defined". I'm not sure
    > what either of those declarations accomplishes.

       TSCII is not recongized by most browsers(it's not registered with
    IANA)[1]. 'x-user-defined' means that to view the page one has
    to configure one's browser to use Tamil 'custom encoded' [2] font
    (in TSCII/TAM? encoding) font when rendering 'x-user-defined' page.
    Most browsers have an option to set fonts for 'x-user-defined'. It's
    certainly better than tagging it as 'iso-8859-1' or 'windows-1252'.

    > > After seeing this effect at its source, it's now clear why you can't
    > > style individual Tamil characters in a word processor, when using
    > > Unicode (whereas you can do so, in legacy encodings).
    >
    > This is browser behavior, not word processor behavior, and certainly not
    > an inherent defect in the Unicode logical-order model. Display engines
    > need to do a better job of applying style to individual reordrant
    > glyphs, that's all.

      You're right. Anyway, this is an interesting challege to
    layout/rendering engines. In case of Korean Hangul (as Philippe wrote),
    it's even more so because unlike Indic scripts[3], it has multiple
    canonically equivalent (and not-canonically-equivalent in Unicode sense
    but nonetheless 'equivalent' in a certain sense) representations.

       Jungshik

    [1] http://bugzilla.mozilla.org/show_bug.cgi?id=186463

    [2] 'Custom' (or 'hack') encoded : Windows-1252, Symbol or MacRoman Cmap
        is used to store Tamil glyphs (or other glyphs for other Indic scripts).
        Needless to say, we want to leave these fonts behind and move on.

    [3] As is well known, there are a few letters for which there are two
       canonically equivalent representations in Indic scripts.



    This archive was generated by hypermail 2.1.5 : Sun Dec 07 2003 - 09:10:40 EST