Re: Transcoding Tamil in the presence of markup

From: Jungshik Shin (jshin@mailaps.org)
Date: Sun Dec 07 2003 - 08:16:38 EST

Next message: Peter Jacobi: "Fwd: Re: Fwd: Re: Transcoding Tamil in the presence of markup"

Previous message: Michael Everson: "RE: Transcoding Tamil in the presence of markup"
In reply to: Doug Ewell: "Re: Transcoding Tamil in the presence of markup"
Next in thread: Peter Jacobi: "Re: Transcoding Tamil in the presence of markup"
Reply: Peter Jacobi: "Re: Transcoding Tamil in the presence of markup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Sat, 6 Dec 2003, Doug Ewell wrote:

> Peter Jacobi <peter underscore jacobi at gmx dot net> wrote:
>
> > Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5
> > the style expands to the entire orthographic syllable.
> > Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
> > TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm
>
> BTW, your "Unicode test page" is marked:
>
> <meta http-equiv="Content-Type"
> content="text/html; charset=ISO-8859-1">

Peter uses NCRs so that it doesn't matter (although I prefer to
tag the page as 'UTF-8', even in that case), does it? Anyway, he
should have used 'lang' tag to help browsers pick up fonts. In two
pages above, simply adding 'lang="ta"' to <table ....> would suffice.
In xref-uc.htm, if you want a fine-grained control, he can just globally
replace '<span class="glyph">&#....</span>' with '<span lang="ta"
class="glyph">&#....</span>'.

> while your TSCII test page is marked "x-user-defined". I'm not sure
> what either of those declarations accomplishes.

TSCII is not recongized by most browsers(it's not registered with
IANA)[1]. 'x-user-defined' means that to view the page one has
to configure one's browser to use Tamil 'custom encoded' [2] font
(in TSCII/TAM? encoding) font when rendering 'x-user-defined' page.
Most browsers have an option to set fonts for 'x-user-defined'. It's
certainly better than tagging it as 'iso-8859-1' or 'windows-1252'.

> > After seeing this effect at its source, it's now clear why you can't
> > style individual Tamil characters in a word processor, when using
> > Unicode (whereas you can do so, in legacy encodings).
>
> This is browser behavior, not word processor behavior, and certainly not
> an inherent defect in the Unicode logical-order model. Display engines
> need to do a better job of applying style to individual reordrant
> glyphs, that's all.

You're right. Anyway, this is an interesting challege to
layout/rendering engines. In case of Korean Hangul (as Philippe wrote),
it's even more so because unlike Indic scripts[3], it has multiple
canonically equivalent (and not-canonically-equivalent in Unicode sense
but nonetheless 'equivalent' in a certain sense) representations.

Jungshik

[1] http://bugzilla.mozilla.org/show_bug.cgi?id=186463

[2] 'Custom' (or 'hack') encoded : Windows-1252, Symbol or MacRoman Cmap
is used to store Tamil glyphs (or other glyphs for other Indic scripts).
Needless to say, we want to leave these fonts behind and move on.

[3] As is well known, there are a few letters for which there are two
canonically equivalent representations in Indic scripts.

Next message: Peter Jacobi: "Fwd: Re: Fwd: Re: Transcoding Tamil in the presence of markup"
Previous message: Michael Everson: "RE: Transcoding Tamil in the presence of markup"
In reply to: Doug Ewell: "Re: Transcoding Tamil in the presence of markup"
Next in thread: Peter Jacobi: "Re: Transcoding Tamil in the presence of markup"
Reply: Peter Jacobi: "Re: Transcoding Tamil in the presence of markup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Dec 07 2003 - 09:10:40 EST