Re: Character Sequences of Uncertain Rendering (was: Version linking?)

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Mon, 28 Aug 2017 07:20:17 +0200

Actually the matras in questions in the first message were neither
left-to-right or right-to-left, they were two-part vowels, and repeatedly
encoded after a base letter.
Malayalam itself is left-to-right but this only makes sense for the order
of base letters. matras encoded after that are placed around it according
to the script rule, but two part vowels cause problem if multiple ones are
used. We know how to order the right parts that are postposed, but there's
no clear order for the left parts that are preposed (including when there
are also preposed one-part vowels).

This is kind of similar to the problem of defining the stacking order when
there are multiple diacritics above (or below) when they all compete for
the same position. If generally the option is to render them ordered from
the innermost to the outermost position (so successive diacritics noramlly
positioned above should stack vertically upward, but there are known
exception where they will be instead not stacking vertically but
horizontally either left-to-right or right-to-left, and some cases where
their order will also be reversed).

There are only common positions and stacking options which should be used
by default in absence of any kind of joiners between them. For all other
cases, we need additional joiner controls between them if this is not the
default. But here, what is the default for the uncomon case where there are
multliple occurences of the same two-part matras ? In my opinion, they
should still be ordering their respective left-part or right from from
innermost to outermost, so the left-parts will be rendered right to left,
and the right-parts will be rendered left-to-right.

Here the problem is that this is performed in Firefox only for a limited
number (2) of preposed one-part vowels or preposed diacritics, or preposed
left-parts (of two-part vowels). So after rendering the first two matras,
there's no space left for the third matra, which will then be rendered
entirely after the cluster, in a separate cluster (missing a base
consonnant so you see the dotted glyph in the middle). IE does seem to do
things correctly by supporting more left-side preposed matras or left-side
preposed "half-matras": it first decomposes the two-part matras into two
pseudo-matras for each part and then order the first pseudo-matra like
other preposed vowels, all by default right-to-left (i.e. from innermost to
outsermost when you place the center of view on the base letter).

But there's no special joiners encoded in Unicode to override the placement
(direction) or relative order of diacritics competing to the same position.
If one was used, it should be encoded just before that diacritic, but
twop-part diacritics are even more challenging as they could possibly need
one or two separate overrides (either for the left-part or the right-part,
or both !)

However for the case given above, it makes no sense to use what Google
Chrome currently renders for "കോോോ" (U+0D15, followed by 3 occurences of
U+0D4B).

To make it clear, I'll use ASCII-only notation : <M> for the base letter
(U+0D15) and <db> for the two-part diacritic U+0D4B, and <o> the dotted
circle.
- When we encode <M,CD>, the rendering should be "CMD". it is OK in all
browsers.
- When we encode <M,CD,CD> we also see "CCMDD" everywhere including in
Chrome or Firefox.
- Then comes the encoding <M,CD,CD,CD> that IE correctly renders as
"CCCMDDD", but Chrome or Firefox cannot render this correctly, they first
render <M,CD,CD> as "CCMDD" then comes <CD> left alone without base
consonnant, so a dotted circle is inserted and we see "CoD" as a glued (but
now separate) cluster, the final result is "CCMDDCoD" (which is still not
breakable whe ntrying to select it with keyboard/mouse/touch).

I think this is caused by the algorithm used in Chrome and Firefox
renderers that only offer at most two positions for preposed parts when
computing the reordered layout of glyphs. IE does this correctly by not
limiting the number of preposed glyphs or using a higher limit (I did not
test by using arbitrarily-long sequences of preposed vowels or two-part
vowels, or at least 4 of them then more).

I know that IE/Edge is capable now to stack very high stacks of diacritics
(and this was implemented probably for the Tibetan script, or for
supporting mathematical notations).

But still, overriding the default direction of stacking is unspecified in
Unicode, except for a few documented cases where some joiner controls are
used (for the "liquid" vowels that we consider as consonnants in Latin, and
that will be present in words borrowed to Indic languages in their script
using matras) to alter the restation of stacking (but without complex glyph
reordering)

consider also the case of Acute accent in Greek whose default position is
by default altered when they occur contextually with capital letters, from
above, to the left. so <CAPITAL ALPHA, ACUTE> is reordered as
<PREPOSED-ACUTE,CAPITAL ALPHA>, but most Greek fonts will render like their
precombined equivalent, using a single assigned glyph, without needing any
rendering. Now consider <CAPITAL ALPHA, MACRON ABOVE, ACUTE>. As the
diacritics have the same combining class placing them by default above,
they are not freely reorderable in the encoding. But does the ACUTE still
inherit the altered placement after the capital? If so, it would reorder
too as <PREPOSED-ACUTE, ALPHA+MACRON> without stacking, but of not, where
will be the ACUTE ? It will likely not be preposed but will stack
vertically centered above the macron. And there's no way to indicate it
would stack vertically in the other direction, except by using some joiner
and encoding <CAPITAL ALPHA,CGJ,ACUTE,MACRON>, the CGJ before ACUTE
blocking the reordering to render it in the proposed above-left position

The same CGJ could be used to prohibit the default altered placement (and
changed glyph form) of the CEDILLA, which occurs for some Latin letters.

We had the case in Latin for the "double acute" accent, for which the
solution was not to encode a second acute accent prepended with a CGJ, but
to encode a separate double acute instead, so that they won't stock
vertically on top of each other, but we have ne clear solution to indicate
the correct placement of ACUTE+GRAVE diacritics or GRACE+ACUTE (should they
stack vertically or horizontally?) Here again we are in a borderline case
where standard orthographies do not provide a "default" best solution, so
we don't know if we can use joiner controls between diacritics and which
ones (if these diacritics are used in romanizations to mark tones, we could
have multiple tones over the same (long) vowel (which could play a long
"melody").

Another problem came later with the proliferation of letters converted to
diacritics (and possibly needing themselves their own diacritics!). The
question remains open: are the encoded diacritics sufficient to represent
complex layouts? Is the Unicode "standard character model" really correct
and suffivient for all cases?

I'd like to see these probleme finding a clean solution: it's probably more
important than the active encoding of many emojos (now with very long
sequences for groups of people which also include their own complex
placement rules)

2017-08-28 4:40 GMT+02:00 Richard Wordingham via Unicode <
unicode_at_unicode.org>:

> On Sun, 27 Aug 2017 19:55:31 +0200
> Philippe Verdy via Unicode <unicode_at_unicode.org> wrote:
>
> > 2017-08-27 6:06 GMT+02:00 Richard Wordingham via Unicode <
> > unicode_at_unicode.org>:
>
> > Canonical reordering is unambiguously refering to the canonical
> > equivalences in TUS. These are automated and can occur at any time,
> > and the only way to avoid them is to insert joiners. But they should
> > never be needed for normal texts, except to split clusters or
> > introduce semantic differences where they are relevant (and in that
> > case the renderers will also try to distinguish them, otherwise they
> > can freely reorder every sequence of diacritics with distinct
> > non-zero combining classes and will represent all canonically
> > equivlent sequences exactly the same way without distinguishing them).
>
> This wasn't the sort of problem I was talking about. The Indic
> example with undefined rendering has two left matras with ccc=0. The
> questions was whether they should be displayed from left to right (as in
> MS Edge) or right to left (as in Firefox).
>
> The problem of diacritics below having different combining classes has
> been raised for minority languages in Thai. There seems a definite
> prospect that the rendering order has to depend on the writing system -
> and the other order would simply be wrong. Standardisation occurs
> outside the purview of the UTC. The order may be forced by CGJ,
> which is a joiner in name only when it occurs before combining marks.
>
> Richard.
>
Received on Mon Aug 28 2017 - 00:21:03 CDT

This archive was generated by hypermail 2.2.0 : Mon Aug 28 2017 - 00:21:04 CDT