Re: Generic base characters (was: Hebrew generic base)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 12 2007 - 18:57:10 CDT

  • Next message: Anto'nio Martins-Tuva'lkin: "Re: Phetsarat font, Lao unicode"

    John Hudson suggested:

    > The sense that matters to me is that layout engines should include the
    > characters that may be used as generic bases in the same text runs as following combining
    > marks, regardless of script or language. That's why the bases are *generic*.

    That seems like a noble goal. And it seems completely consistent with the
    intent of the standard in not constraining what generic combining marks
    could be applied to what generic symbols.

    Already, since generic symbols (as well as letters) are all base characters,
    following them by a combining mark creates a well-formed combining character
    sequence. And if they are non-spacing marks to boot, they will form default
    grapheme clusters and for most processing purposes should not be separated.

    The problem comes when you have a script identity mismatch between base
    and combining mark, so that your layout engine gives up and goes back to
    fallback behavior, because it doesn't know how to apply, say, Devanagari
    matras to Tibetan consonants or Greek letters or Arabic letters, for example,
    and because to display at all you may end up needing to use one font for the
    glyph for the base and a different font for the glyph for the combining mark.
    That is when a layout engine ends up splitting a combining character sequence
    into two text runs and inventing ways of displaying the parts separately
    (with or without a dotted circle glyph introduction, for example).

    > So what is
    > the easiest way to implement this? Define a set of characters that may be used as generic
    > bases, based on documentation of existing conventions, and specify that these should all
    > be treated in the same way as the dotted circle base.

    Well, accumulating information about actual usage and existing conventions
    strikes me as a useful exercise, particularly for font designers who may
    end up having to include behavior in fonts to account for them. But how
    would this end up being something defined *in Unicode*?

    The standing way to "define a set of characters" in the Unicode Standard
    is to invent a new property that defines that set. What property are
    we talking about here? A binary property, Generic_Base? How would the UTC
    maintain that property? Would it be guaranteed to be a proper subset of
    the derived character property Grapheme_Base? (One would think so.) But
    what constraints would there be on characters that could be Generic_Base=True?
    I would think, given the considerations that go into separating text runs
    in the first place, and not wanting to have to figure out how to apply
    DEVANAGARI VOWEL SIGN U to ARABIC LETTER GHAIN, that you would want to
    say that a Generic_Base character could not be any particular script,
    such as Script=Arabic.

    but if you are heading in that direction, why not at least investigate the
    notion that the starting point should be more generically defined, at
    least from the point of view of the Unicode Standard. What about just
    looking at the generic problem as the sequence:

      < [:Script=Common:] & [:Grapheme_Base=True:], [:gc=Mn:] >
      
    That is, if you have a base character that is Common script, and you follow
    it by a non-spacing mark, a layout engine ought to render it, even if
    not necessarily very well, regardless of the script of the non-spacing mark.

    That formula, by the way, would pick up all the instances folks have been
    talking about so far as generic base characters, including
    U+002D HYPHEN-MINUS, U+005F LOW LINE, U+00A0 NO-BREAK SPACE, U+00D7
    MULTIPLICATION SIGN, as well as U+25CC DOTTED CIRCLE. It also gets
    *all* of the geometric shapes in the 25A0..25FF block, for example,
    some of which are other obvious candidates for serving as a generic base.
    And why not allow U+2639, the frownie face, serve as the generic base
    for display of Devanagari non-spacing marks. I'm sure *somebody* will
    eventually think of doing that. ;-)

    That is the level of generic display behavior that I think is already the
    intent of the Unicode Standard.

    Individual layout engine developers could choose to
    go further, based on particular conventions relevant to the
    particular scripts they are concerned about, and, for example, support display of
    *all* combining marks from an applicable subset, including spacing
    combining marks (even the ones that reorder), with respect to a particular
    generic base (or a small, defined list of such bases). That is what John seems
    to be talking about when saying that a font for Devanagari, for example,
    will include the dotted circle as a generic base for display of all the
    matras in isolation.

    But I don't see the UTC wanting to head into that territory, defining what
    layout engines can and should support for that kind of extended display
    behavior in script-specific cases.

    > If the UTC are interested in this idea, I can start defining such a set and gather
    > feedback and requested additions from publishers, lexicographers, scholars, etc.

    It is just my opinion, but it seems to me that the UTC would be interested
    in the general problem of ensuring that layout engines aren't doing
    unreasonable and counterintuitive things in displaying non-spacing marks.

    Also I don't see any problem with accumulating statements to publish
    about particular, notable orthographic practices, such as "By convention,
    Lao non-spacing vowels and tone marks, when displayed in isolation, are
    often shown with an x-shaped generic base." That might help developers
    of layout engines and fonts do the right thing, or at least put them on
    notice about some behavior of relevance.

    I'm less sure that the UTC would be interested in trying to formally
    define and maintain a Generic_Base property and try to determine
    which particular small set of characters could correctly be given
    that property.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Jul 12 2007 - 18:59:53 CDT