From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jul 24 2010 - 13:25:15 CDT
"Kent Karlsson" <kent.karlsson14@telia.com> wrote:
> Den 2010-07-24 10.07, skrev "Philippe Verdy" <verdy_p@wanadoo.fr>:
>
> > Double diacritics have a combining property equal to zero, so they
>
> No, they don't. The above ones have combining class 234 and the below
> ones have combining class 233 (other characters with the word DOUBLE
> in them are 'double' in some other way):
>
> 035C;COMBINING DOUBLE BREVE BELOW;Mn;233;NSM;;;;;N;;;;;
> ...
Aren't they using the maximum value of the combining class ? If so,
you can still use double diacritics betweeb two sequences containing a
base character and any "simple" diacritic, and be sure that the double
diacritic will be rendered about them, as it will remain in the last
position of the normalized form.
Anyway I also said that a character with combining class 0 was needed
to add other diacritics on top of double diacritics, after encoding
the two sequences joined with the double diacritic.
Why did you assign such bogous non-zero combining class for double
diacritics is a mystery for me, as it was really not needed for
compatibility with legacy encodings?
These combining classes 233 and 234 have absolutely no interest except
that it complicated things for absolutely no benefit (including the
fact that now an additional character with combining class 0, such as
CGJ or other, is always needed to stack anything else on top of double
diacritics).
I did not realize that before (yes I should have looked in the UCD to
verify). And given their existing behavior, this has prevented other
simpler encodings of texts.
Also I have NEVER found any occurence ever where the fact that they
have combining class 233/234 instead of 0 makes any difference,
because double diacritics where ALWAYS encoded between the two base
graphemes encoded separately, and the canonical order preserves this
encoding position in all cases between the two base graphemes encoded
completely.
Note that I'm not even sure that CGJ is the right choice for stacking
more diacritics on top of double diacritics, because it would mean
that the additional diacritic will need to be encoded just after the
double diacritic and CGJ, but before the second grapheme, and this
does not really match with double diacritics used between triplets of
graphemes: where the additional diacritics need to be placed, on the
first or on the second double diacritic ?
For me the logical ordering would require encoding first the base
graphemes, separated by the double diacritic, then encode the
additional diacritics applicable to the whole previous group (and so
it requires adding a new virtual base to block the reordering.
(1) If using CGJ at end of the sequence containing the two bases and
the double diacritic, it will still attach logically and visually the
additional diacritics to the last base grapheme, and so they will
still stack on them, below the double macron for example, even if
their relative order is preserved.
It's needless (or logically wrong), in this order, to use CGJ instead
of ZWJ, in a sequence like:
<base-1, double-diacritic, base-2, CGJ, additional-diacritics>
because in that position, CGJ has no other effect to block the
reordering of additional-diacritics as they are already blocked by
base-2, so it would be still interpreted as:
<base-1, double-diacritic, base-2, additional-diacritics>
and so the additional diacritic will be linked to base-2, and the
double diacritic will cover the full group containing <base-1> and
<base-2, additional diacritics>
(2) The only way to encode the additional diacritics in the middle of
the group linked by CGJ, in this order:
<base-1, double-diacritic, CGJ, additional diacritics..., base-2>
and it will be impossible to have longer groups applying the double
diacritic to more than 2 bases. This encoding using CGJ clearly breaks
the logical assumption that the additional diacritic applying to a
group should be all encoded AFTER the full group has been encoded.
Here the additional diacritics need to be inserted at a specific
position in the middle of the sequence (and in pratice, for input
editors, they would have to scan back before base-2 through the
additional diacritics and CGJ just to find the double-diacritic and
see that any further diacritics need to be inserted there...)
CGJ was not intended to apply to more than one character, but only as
a way to block some normalized reordering of combining characters
occuring after a single base character (which always has combining
class 0). In that position, it should only occur between two
combining characters with non-0 combining class, and only if the
second onle has a lower combining class than the first one, and only
if this creates a semantic or visual difference on rendered documents
(for example because of the variable positions of the cedilla, that
the combining class are unifying as if it was unique).
(3) Using ZWJ, this terminates the last base grapheme so you can
safely append other diacritics applying to the whole group joined by
the double diacritic, and this becomes encoded very logically in this
order:
<base-1, double-diacritic, base-2, ZWJ, additional-diacritics>
Where it will have a more consistant behavior, if ever double
diacritics or ZWJ are not supported by the renderer to create long
groupings. In that position, if the renderer can only draw the
double-diacritic with nothing else on top of it, the additional
diacritics will be drawn after the sequence of the two bases and the
double diacritic, and only the additional diacritics will be drawn
like a defective sequence (by drawing a dotted circle for example).
(4) With ZWJ as the base separator with combining class 0 (just like
CGJ which has a more "local" usage, to force the relative order of
simple diacritics above only one base grapheme, when it has to be
semantically different from the canonical order) between the last base
grapheme and the addition diacritics (which I think is logically
better than CGJ), we could *also* have longer sequences such as:
<base-1, double-diacritic, base-2, double-diacritic, base-3, ZWJ,
additional diacritics...>
without any ambiguity about which double diacritic should "support"
the additional diacritics. The occurences of double diacritics should
be treated indistinctly where they ever occur ; by default, in a
simple renderer, they will overlap in the middle except above the
first and last base graphemes, but a smarter engine will avoid this
overlap (when they are identical) and will draw a longer diacritic
covering more all base graphemes on which the double diacritic is
encoded.
I've still not seen encoded texts needing that, but such groupings
with more than two base graphemes is common in the litterature (for
example when emphasizing trigrams like "sch" in German, or even "str"
in English, or finals appended to conjugated verbs or declined nouns,
or in phonetic notations needing longer ties to group complex groups
of consonnants or diphtongs).
In some cases, they are acting like interlinear annotations (such as
emphasized trigrams, where it acts like an alternate underlining), but
in others they have a semantic value within the encoded text itself
from which they can't be safely detached (such as in phonetic
notations, or in mathematical notations and other scientific and
technical formulas).
Anyway, I still think that double diacritics are a "hack" inserted in
the UCS and now they clearly appear as an unjustified desunification
of the diacritics: we should be able to encode the NORMAL (non-double)
diacritics (from any Unicode block where it is already encoded) and
apply them to an arbitrarily long group of characters, encoding the
normal diacritics in the logical order after encoding the group,
because:
- most of them were added in the UCS before ZWJ was encoded.
- this is the natural order with which they are perceived and drawn.
- this is the natural way of interpreting the diacritics (and they are
not necessarily "elongated")
- the concept of groupings is inherent to the logical semantic of the
text, and should be preserved by its encoding.
Adding the explicit encoding of semantically significant groupings
(and that are still missing) was certainly more important than adding
these desunified "double" diacritics (that also have their own
distinct combining class). Not only this encoding of double diacritics
did not solve the problem completely within a general character model,
but it added new exceptions and problems for automated text parsers
and renderers.
Philippe.
This archive was generated by hypermail 2.1.5 : Sat Jul 24 2010 - 13:29:49 CDT