RE: Generic base characters

From: Kent Karlsson (kent.karlsson14@comhem.se)
Date: Tue Jul 17 2007 - 06:50:10 CDT

  • Next message: Otto Stolz: "Triple vowels (was: Generic base characters)"

    Kenneth Whistler wrote:
    > I think you are missing the point here. The domain of
    > conjunct ligatures in Devanagari is the aksara. You parse for
    > aksara boundaries and don't attempt to map into ligature space
    > across those boundaries. That is rather different from how
    > ligatures work for Latin or for Arabic, for that matter.

    I do hope that no rendering system PREVENT ligation across
    aksara boundaries. There just happens not to be a need for
    ligation across aksara boundaries. But preventing such
    ligatures seems unnecessary.

    ...
    > > > {MA + O}, {O}, {O}
    ...
    > > > One option is to display each on a dotted circle.
    > >
    > > No why?
    > >
    > > > Another option is to display each on a blank.
    > >
    > > No, why?
    >
    > Why not? Reasonable people disagree. And when reasonable people
    > disagree about cases like this, the usual compromise solution
    > is to give them choices, so they can get things to display
    > the way they want to.

    Which leads to incompatible implementations, and we've seen
    that that leads to problems also in this case. Some implementations
    inserting spurious dotted circles, others being more well-behaved.
    And voila, interoperability problems, as well as some systems
    misdisplaying someones texts.

    > But I certainly don't think that the Unicode Standard is
    > ever going to mandate that such options as displaying dotted circles
    > fpr combining marks that don't fit into canonical aksara
    > structure must be avoided.

    I think it should. See above. Though perhaps not a conformity issue
    (as far I can see, it is permissible, conformity-wise, to insert
    dotted circles nilly-willy also in Latin texts), that kind of
    implementation certainly should be frowned upon for any script.

    > > > Another reasonable position to take would be that extra
    > > > matras for an aksara are intentional "misspellings" that
    > > > users might introduce for effect, and the rendering engine
    > > > ought to attempt to rendering them as part of an aksara,
    > > > either by joining them in sequence, or by default stacking
    > > > rules (depending on their placement, of course). But doing
    > >
    > > Indeed.
    > >
    >
    > > > What if the sequence were, instead:
    > > >
    > > > MA + I + I + I
    > > >
    > > > Then what? In Devanagari, the I-matra reorders to the left
    > > > around the MA (and possibly other units as well, if present).
    > > > So is the "reasonable" position now to treat this for
    > > > display as:
    > > >
    > > > {I + MA} + {I} + {I}
    > >
    > > No, why?
    >
    > Why not?

    Because the {I} referred to here is reordrant, and should reorder around
    (at least, ignoring virama for the moment) the display of the preceeding
    combining sequence.

    > > > and use fallback display for the two extra matras?
    > > >
    > > > Or is the "reasonable" position to require indefinite
    > > > leftward reordering of the layout engine, to get:
    > > >
    > > > {I + I + I + MA}
    > >
    > > Surely. They should work like any combining category 224
    > > character, i.e. stack to the left.
    >
    > Combining category 224 characters *don't* "stack to the left".

    They should.

    ...
    > > As long as this stays within
    > > a line (with some not-too-small preset max), there should
    > > be no problem. (It would have been better to just give
    > > the reordrant vowels cc 224 rather than 0!)
    >
    > Mistaken premise. I'm willing to bet that there is indeed
    > a problem with expecting rendering engines to stack
    > ccc=224 marks indefinitely.

    "Indefinitely" is surely too much. Just as indefinite stacking of
    diacritics above is surely too much. (They starts going outside of
    line and page boundaries after a little while.) But maybe you meant
    "to an unspecified amount" rather than "to an unbounded amount".

    > > Unfortuantely, some of the later encoded scripts with
    > > two-side vowels lack a decomposition to left and right
    > > side characters for those two-side vowels, so then one
    > > will need some other mechanism to represent the left
    > > and right parts (PUA code points or extra bits somewhere).
    >
    > You're talking about Khmer, presumably. But it shouldn't
    > matter one way or the other whether there is a decomposition.
    > The canonical equivalences in the other Indic cases means
    > the reordering on display occurs *whether or not* the
    > character backing store is decomposed, and the reordering
    > happens in glyph space, anyway, not in character space.

    Like bidi reordering, I would implement this in "character
    space", just before fully mapping to "glyph space". I know
    that the reordering, how far-reaching it is, depends on whether
    a conjunct ligature is present in the font or not; so initial
    mapping to "glyph space" needs to have been done, but otherwise
    this reordering is font independent.

    > > ...
    > > > Nor is the business of faithful rendering of Bengali two-part vowels
    > > > around Tibetan consonant stacks,
    > >
    > > Why should that be a problem in principle? Ignoring ligatures,
    > > which I would think should not happen cross-script, treating
    > > the reordering per se would depend only on the combining category,
    >
    > That is your basic mistake, I think. Reordering depends on
    > the context of script behavior. And if you mix script boundaries

    Script boundaries are basically irrelevant for this. So dividing
    into "script runs", and processing each "script run" separately is
    a mistake. Just like dividing into script runs (separating Hebrew,
    Arabic, and Syriac (as well as other scripts) before bidi processing
    would be a mistake.

    > across what is otherwise complex rendering, it is perfectly
    > valid for a rendering engine to wave an exception and say,
    > effectively, I can't do that -- just as reasonable as saying
    > it doesn't know how to ligate Bengali to Tibetan.

    A (rather strange) font that has a ligature between a Bengali
    character and a Tibetan character should work just fine...
    Just like a font that has ligation between Arabic and Hebrew.
    (We just saw on this list reference to documents mixing Arabic
    consonants with Hebrew vowels, albeit not ligated IIUC).

            /kent k



    This archive was generated by hypermail 2.1.5 : Tue Jul 17 2007 - 06:52:49 CDT