Re: Generic base characters (was: Hebrew generic base)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 12 2007 - 18:57:10 CDT

Next message: Anto'nio Martins-Tuva'lkin: "Re: Phetsarat font, Lao unicode"

Previous message: Kenneth Whistler: "RE: Hebrew generic base"
Maybe in reply to: John Hudson: "Generic base characters (was: Hebrew generic base)"
Next in thread: John Hudson: "Re: Generic base characters (was: Hebrew generic base)"
Reply: John Hudson: "Re: Generic base characters (was: Hebrew generic base)"
Reply: Philippe Verdy: "RE: Generic base characters (was: Hebrew generic base)"
Reply: Christopher Fynn: "Re: Generic base characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

John Hudson suggested:

> The sense that matters to me is that layout engines should include the
> characters that may be used as generic bases in the same text runs as following combining
> marks, regardless of script or language. That's why the bases are *generic*.

That seems like a noble goal. And it seems completely consistent with the
intent of the standard in not constraining what generic combining marks
could be applied to what generic symbols.

Already, since generic symbols (as well as letters) are all base characters,
following them by a combining mark creates a well-formed combining character
sequence. And if they are non-spacing marks to boot, they will form default
grapheme clusters and for most processing purposes should not be separated.

The problem comes when you have a script identity mismatch between base
and combining mark, so that your layout engine gives up and goes back to
fallback behavior, because it doesn't know how to apply, say, Devanagari
matras to Tibetan consonants or Greek letters or Arabic letters, for example,
and because to display at all you may end up needing to use one font for the
glyph for the base and a different font for the glyph for the combining mark.
That is when a layout engine ends up splitting a combining character sequence
into two text runs and inventing ways of displaying the parts separately
(with or without a dotted circle glyph introduction, for example).

> So what is
> the easiest way to implement this? Define a set of characters that may be used as generic
> bases, based on documentation of existing conventions, and specify that these should all
> be treated in the same way as the dotted circle base.

Well, accumulating information about actual usage and existing conventions
strikes me as a useful exercise, particularly for font designers who may
end up having to include behavior in fonts to account for them. But how
would this end up being something defined *in Unicode*?

The standing way to "define a set of characters" in the Unicode Standard
is to invent a new property that defines that set. What property are
we talking about here? A binary property, Generic_Base? How would the UTC
maintain that property? Would it be guaranteed to be a proper subset of
the derived character property Grapheme_Base? (One would think so.) But
what constraints would there be on characters that could be Generic_Base=True?
I would think, given the considerations that go into separating text runs
in the first place, and not wanting to have to figure out how to apply
DEVANAGARI VOWEL SIGN U to ARABIC LETTER GHAIN, that you would want to
say that a Generic_Base character could not be any particular script,
such as Script=Arabic.

but if you are heading in that direction, why not at least investigate the
notion that the starting point should be more generically defined, at
least from the point of view of the Unicode Standard. What about just
looking at the generic problem as the sequence:

< [:Script=Common:] & [:Grapheme_Base=True:], [:gc=Mn:] >

That is, if you have a base character that is Common script, and you follow
it by a non-spacing mark, a layout engine ought to render it, even if
not necessarily very well, regardless of the script of the non-spacing mark.

That formula, by the way, would pick up all the instances folks have been
talking about so far as generic base characters, including
U+002D HYPHEN-MINUS, U+005F LOW LINE, U+00A0 NO-BREAK SPACE, U+00D7
MULTIPLICATION SIGN, as well as U+25CC DOTTED CIRCLE. It also gets
*all* of the geometric shapes in the 25A0..25FF block, for example,
some of which are other obvious candidates for serving as a generic base.
And why not allow U+2639, the frownie face, serve as the generic base
for display of Devanagari non-spacing marks. I'm sure *somebody* will
eventually think of doing that. ;-)

That is the level of generic display behavior that I think is already the
intent of the Unicode Standard.

Individual layout engine developers could choose to
go further, based on particular conventions relevant to the
particular scripts they are concerned about, and, for example, support display of
*all* combining marks from an applicable subset, including spacing
combining marks (even the ones that reorder), with respect to a particular
generic base (or a small, defined list of such bases). That is what John seems
to be talking about when saying that a font for Devanagari, for example,
will include the dotted circle as a generic base for display of all the
matras in isolation.

But I don't see the UTC wanting to head into that territory, defining what
layout engines can and should support for that kind of extended display
behavior in script-specific cases.

> If the UTC are interested in this idea, I can start defining such a set and gather
> feedback and requested additions from publishers, lexicographers, scholars, etc.

It is just my opinion, but it seems to me that the UTC would be interested
in the general problem of ensuring that layout engines aren't doing
unreasonable and counterintuitive things in displaying non-spacing marks.

Also I don't see any problem with accumulating statements to publish
about particular, notable orthographic practices, such as "By convention,
Lao non-spacing vowels and tone marks, when displayed in isolation, are
often shown with an x-shaped generic base." That might help developers
of layout engines and fonts do the right thing, or at least put them on
notice about some behavior of relevance.

I'm less sure that the UTC would be interested in trying to formally
define and maintain a Generic_Base property and try to determine
which particular small set of characters could correctly be given
that property.

--Ken

Next message: Anto'nio Martins-Tuva'lkin: "Re: Phetsarat font, Lao unicode"
Previous message: Kenneth Whistler: "RE: Hebrew generic base"
Maybe in reply to: John Hudson: "Generic base characters (was: Hebrew generic base)"
Next in thread: John Hudson: "Re: Generic base characters (was: Hebrew generic base)"
Reply: John Hudson: "Re: Generic base characters (was: Hebrew generic base)"
Reply: Philippe Verdy: "RE: Generic base characters (was: Hebrew generic base)"
Reply: Christopher Fynn: "Re: Generic base characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jul 12 2007 - 18:59:53 CDT