Re: Representative glyphs for combining kannada signs

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Mon Mar 20 2006 - 07:05:14 CST

  • Next message: Antoine Leca: "Re: Representative glyphs for combining kannada signs"

    Mark Davis wrote:
    > The purpose in the standard for describing the rendering of different
    > scripts is not to limit the range of variations available to type
    > designers. It is really to establish how a given string that a user
    > sees is to be represented internally.

    Yes, thanks to remember it to us, it matches perfectly the paragraph on page
    5 I directed Vinod at earlier in this thread.

    > For complex scripts, the ordering (and to a certain degree the
    > shaping behavior) has to be a fundamental part of the encoding
    > standard; otherwise it is impossible to get interoperability
    > across systems.

    Not sure what is your point. Since you are answering to my post not Vinod's
    or Philippe's, I assume you are trying to explain how wrong I was. Please
    elaborate then.

    I agree we have to explain that in Nagari ि (i) matra stands to the left (as
    depicted in the chart), and that the Kannada vowel signs are combining (also
    as depicted, and which was the thing Philippe was worrying about initially
    in this thread). That is basic, and I do not deny it.
    You also have to explain a bit how virama behaves, since it results in
    ligaturing. Furthermore, it is clear (to me) that you should give some
    tables (9-1, 9-2, 9-10) of the most common resulting ligatures, to allow
    identification of them. Some of these tables are still missing or could be
    expanded, yes of course.

    However, when there are rendering variations (like optional ligatures, or
    different shapes), I believe specifying it is going too far. Particularly if
    you intend to specifying it /exhautively/ (since some people are trying to
    pretend there is not a word "minimal" at the end of the first line in the
    Rendering paragraph on page 224.)
    For example, & is well-known to have a wiiiiiiiiide number of possible
    representations (glyphs in this case), and my guess is that Unicode is NOT
    in the business to document each and every of them. Similarly for a Q with
    the tail expanding below the neighbour u, Umlaut shown as strokes or dots, a
    closed-counter a, etc. (Well, the latter better has to be signalled somehow,
    since there is the ambiguity with U+0251 ɑ).

    So if you are reading Hindi, and you encounter some ङ्कि cluster more or
    less conjoined, or with the ि matra put in one or other position, /usually/
    there should not be any worry, since there is only one way to understand it
    (ṅki), and so there should be only one way to encode it with Unicode
    codepoints, <U+0919 U+094D U+0915 U+093F>. And once you know the matra are
    stored in phonetic order and there is a virama, it is the _only_ possible
    way to encode; job well done, felicitations!

    I know the use of ZWxJ and similar artefacts allow to specify special
    behaviours, for example disallowing the full-formed conjunct. However, it is
    *not* normal behavour, and usual users of Hindi should not have to go to it,
    unless there is a special and _unusual_ need (such as perhaps the use by the
    original author of both forms in the original source, which could ask for
    precision.)
    The position of the ि matra in a (decomposed) cluster is IMHO of a similar
    level.
    In the reverse way, my take is that Unicode does not need to MANDATE one or
    another of the possible representation for this sequence: as the result is
    unambiguous, anyone would be valid.

    Similarly, Malayalam have two styles (described as effect of the "reform").
    The decision was taken to encode it however as only one script (I believe to
    give better interoperability). One should now take act of this decision, so
    this means describing the variation between them (again, this is only partly
    done; in the case here, about ര following another consonant, the text is
    presently insufficient: it should additionaly describe the fact in
    "modern"/"reformed" style it stands as an single glyph preceding the
    consonant/cluster).
    This does NOT mean *specify* one (or the other) style as being the "correct"
    one. Quite the contrary, in fact.

    The third example, with Malayalam ൌ U+0D4C (au), is more interesting,
    because here we are not as much as advanced in the specification here. For
    about half a century, Malayalee are dropping the leading െ (which is not
    needed to be able to read it correctly, the right part is self-significant).
    However, how should a െ-less ൗ be encoded in Unicode is not crystal clear
    (we currently see two usages, U+0D4C and U+0D57).
    Here there is probably the need for an additional rule, to establish a
    canonical choice, which in turn will give the widest interoperability.
    However, this rule won't contraint the rendering of the *other* codepoint
    (the one which will be /deprecated/ for this use): this deprecated will
    continue to be shown as it is today, that is as െ-less ൗ (backward
    compatibility) for those engines which formerly recognized this sequence as
    correct, and as strange/defective/clear-indication-of-wrongness by those
    engines which did not. Similarly to what is now occuring with Thai: there is
    a defined ordering (vowels before tone indicators), but the "defective"
    orderings could be rendered too... or could be visually flagged as
    incorrect; at any rate, they are valid Unicode codepoints and sequences, but
    there are no MANDATORY rules for their renderings.

    Antoine

    > Antoine Leca wrote:
    >> Philippe Verdy wrote:
    >>> Mandatory things like visual inversions (of vowel signs letters
    >>> beforethe consonnant cluster, or of a leading dead RA after the
    >>> consonnant letter) MUST still be specified in TUS.
    >>
    >> Thanks for giving me the hand!
    >>
    >> You can specify when there is an unambiguous behaviour to be
    >> followed. Unfortunately, this is not always the case.
    >>
    >> Example 1, Hindi: should the I matra precedes the whole cluster, or
    >> only the last freestanding consonant, in the case of a cluster
    >> constituted from two or more visually distinct components?
    >>
    >> Example 2, Malayalam: dead RA can come either _before_ the (last
    >> part of the) consonant, or _below_ it.
    >> Not _quite_ the same thing, particularly if you contrast it with the
    >> fact that Uniscribe (and several similar rendering tools) will
    >> reorder the leading RA before the consonant in the backing store,
    >> but it will not do this reorder for a subjoined RA...
    >>
    >> Example 3, Malayalam again: the matra for AU U+0D4C can be shown
    >> either as two parts (as depicted in the tables), or only as the
    >> right part.
    >>
    >>
    >> Now what exactly can *mandate* the Standard (if it were in charge
    >> of, which it is not, as I remember earlier)?



    This archive was generated by hypermail 2.1.5 : Mon Mar 20 2006 - 07:11:09 CST