Re: How to make "oo" with combining breve/macron over pair?

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Mar 05 2002 - 21:00:58 EST


David Hopwood said:

> Kenneth Whistler wrote:
> > Kent Karlsson's suggestion:
> >
> > > I vaguely suggested adding
> > > an enclosing (in some sense) invisible combining character to
> > > solve this: <o, CGJ, o, invisible-enclosing, combining breve>.
> > > No character has been designated for such use, though. And I
> > > haven't made a formal proposal yet.
> >
> > (i.e. create a generic way to make a non-enclosing combining mark
> > apply to a grapheme cluster, by encoding an invisible enclosing
> > combining mark)
>
> For this approach to work, <invisible-enclosing> must have combining
> class 0, and be in Grapheme_Extend and general category Mn.

Actually, it must be general category Me, since that is what indicates
a combining *enclosing* mark.

> Because it
> involves a new character, it can't be included in the standard until
> Unicode 3.3,
  ^^^^^^^^^^^

Aghhh! Don't even introduce that nasty concept. The UTC and the
editorial committee are already working on the Unicode 4.0 book
draft, and many people would be sorely tempted to quit in disgust
if we had to produce yet another UAX for Unicode 3.3 before
4.0 was finished!

>
> An alternative is to use CGJ itself for <invisible-enclosing>, i.e.
> <o, CGJ, o, CGJ, combining breve>. This works because:
>
> - CGJ has combining class 0, so it prevents the breve from composing
> with the second o.
> - CGJ has general category Mn and is invisible, as required.

It currently has general category Mn, but would have to be changed
to Me to make this work.

> - it is straightforward to modify the grapheme breaking rules to
> treat this as a single cluster, by adding the rule "Link × Extend".
> (This assumes the corrections to the other rules that I described
> in my comments.)

Actually, I am finding myself attracted to the parsimony of this
approach. In answer to Rick's suggestion to just encode the two we
know about and be done with it, and his concern that we are headed here
for terminal Markupville, note the following:

1. Rendering applications already have to deal with combining
   enclosing marks (well, at least if they choose to support them).
   That means identifying what they enclose, and then adjusting any
   following combining mark to apply to the enclosure. (cf. TUS 3.0,
   p. 50). If the CGJ is just an invisible combining enclosing mark,
   then effectively it encloses the (invisible) bounding box of
   the preceding characters in its scope, and any following
   combining marks are adjusted to apply to that bounding box, which
   is the enclosure. A simple generalization without any new architectural
   implications.

2. Applications concerned with grapheme cluster boundaries already
   (as of Unicode 3.2, at least) have to deal with the function
   of CGJ in creating grapheme clusters. That is, they will have
   to cope with the modified rules in Unicode 3.2 for grapheme
   cluster boundaries, and the new Grapheme_XXX properties that
   take the CGJ into account.

So no new characters and no new architectural implications. Simply
two minor tweaks:

   a. Modify the grapheme cluster boundary rules to account for
      X CGJ NSM as a grapheme cluster.

   b. Change CGJ from Mn to Me.

That appears to be it, and in principle it should solve the
missing double (or treble) diacritic representation problem permanently.
On the downside, it might be awhile before rendering engines
and font definitions really catch up to it. That is, the whole
notion of "adjusting" a diacritic to apply to an enclosure is
fairly sophisticated, since it may involve context-dependent
rules and arbitrary shape modifications -- not merely moving
a glyph origin point based on a preceding glyph's metrics.

On the other hand, hacked up fonts for limited dictionary
usage could be rather quick and easy. For the old Webster's
pronunciation guides, the entities are really the oomacr
and oobreve shown in the examples that started this thread.
Simply preform those entities as glyphs in a font, and map them
to <o, CGJ, o, CGJ, combining_macron> and
to <o, CGJ, o, CGJ, combining_breve> respectively. Presto,
you have a Unicode representation for the text, and a
reliable font rendering for them, without any fancy-dancing
about dynamic positional adjustments. The fallback rendering,
in applications and fonts not wise to the CGJ rules would
be {o o-macron} and {o o-breve}, which while not exact,
is at least comprehensible and close enough for gummint work.

I think this might be the way to go, but it is too late to
sneak into Unicode 3.2, as any such changes clearly would
require UTC debate and agreement. But it is simple enough that
it might be accomplished fairly quickly after Unicode 3.2.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Mar 05 2002 - 21:04:30 EST