Re: Visarga, ardhavisarga and anusvara -- combining marks or not?

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Aug 25 2009 - 13:52:06 CDT

  • Next message: Murray Sargent: "RE: Do the CR & LF bytes in UTF-8 ONLY exist in this form?"

    Peter,

    I think that this discussion shows a rather general problem
    with the Mc classification.The majority of these characters
    are intended to be rendered in a way that is indistinguishable
    from ordinary characters (they simply follow the preceding
    character).

    A few years ago, there was a distinction introduced into the
    standard between graphically combining characters and
    combining characters by classification. You'll find the details
    in the appropriate section of chapter 3.

    Yet (too many) renderers continue to implement the Mc
    characters using the same machinery as that for graphically
    combining characters, including placing limits on the allowed
    "base" characters that precede them, introducing dotted
    circle in any context not previously foreseen and other
    such things.

    I for one am convinced that the way the Mc classification
    was applied was either poorly thought out or altogether
    a mistake. However, it now exists in the standard.

    In principle, you can take three courses of action. One
    is a modified from of the 'do nothing' - it would document
    problems with isolated cases of implementation Mc characters.
    The implementers of rendering systems might notice
    these nuggets of information, and may make corrections
    on a case-by-case basis.

    The second is the radical solution: reclassify every single
    character from Mc to Lo where there isn't any compelling
    reason (in rendering or processing) to consider that
    character actually "combining" in function, not just in name.
    The advantage of this approach is that it would be very
    visible and direct. Treating an "Lo" character by using the
    support for graphically combining characters in a
    renderer is obviously wrong, so you might expect a
    pressure on *all* implementations to get that corrected.

    The downside, of course, is that it's impossible to predict
    what uses the gc=Mc classification has been put to by
    actual implementations, outside of simple rendering issues.
    You are correct in calling such an approach destabilizing,
    no matter how appealing it would be, otherwise. For
    the same reason, UTC is correct to continue to be
    consistent with past practice in assigning Mc to any new
    characters that are analogues to existing Mc characters.

    The third approach would leave the actual assignments in
    place, but achieves the same effect by a highly visible effort to
    document the improved understanding of what it means
    for a character to have classification Mc.

    Unlike the first option, this would not be a case-by-case
    annotation of a few problematic characters in diverse
    script chapters, but would have to be more up-front.

    Where ever combining marks are discussed in the
    standard, the distinction between true "graphically
     combining" characters and mere notional combining
    marks needs to be highlighted and clear implementation
    guidelines given (such as "don't use special rendering
    for most Mc characters, render them like Lo characters").

    A similar, high profile discussion of this belongs into
    the FAQ on Indic scripts, and any other publications
    likely to be consulted by people implementing fonts
    and renderers.

    A./



    This archive was generated by hypermail 2.1.5 : Tue Aug 25 2009 - 13:54:14 CDT