From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 24 2007 - 11:24:25 CDT
The Unicode normalization algorithm specifies which of these canonically
equivalent sequences is the preferred interpretation. As there's no evidence
that none of them is graphically resolvable ans as the preferred reading
order would be dependant of the language using sich characters with multiple
diacritics, this should not make any difference of interpretation if you use
one of the other of the 3 possible sequences (although their strict identity
may still be distinct only at the graphical rendering level, due to
implementation limits).
If there are multiple diacritics, and effective reasons why their intended
reading semantic order is important, then you need to use some invisible
combining joiner when encoding the grapheme (if the renderer can exhibit the
differences, then it will follow the hints provided by these joiners when
selecting the appropriate glyphs, but some renderers may still default to
using the same visual appearance even when using an explicit combining
joiner to specify the intended order).
But for a dot below and a dot above diacritic, I see absolutely no way in
which they would collide, so their encoding order does not matter, and there
should be no combining joiner encoded between these two except in cases like
:
<BASE letter, combining dot above, surrounding circle, combining dot below>
and
<BASE letter, combining dot below, surrounding circle, combining dot above>,
where the combining surrounding circle would block the reordering.
The relative values of non-zero combining classes don't have any
demonstrated semantic meaning and do not imply any forced reading order
excep if the diacritics may collide graphically for the same place (this is
easily seen in the Hebrew script, where you need joiners to help
disambiguate the semantic order of a sequence of diacritics in order to
generate the correct visual rendering).
There are however known exceptions for the case of some diacritics used in
the Latin script, like the cedilla which moves from the usual below position
to the top-left position depending of the letter-case of the base letter:
for such extremely rare cases, it could be necessary to add joiners to avoid
ambiguities, if the default order specified by the relative combining
classes of diacritics is not the correct one.
Note that in all cases, if the multiple diacritics have the same combining
class, then their relative encoding order in texts is significant as
distinct orders of these diacritics are NOT canonically equivalent.
Note also that Unicode does not currently specify how multiple diacritics
stack around the base letter; generally above and below diacritics do
implicitly stack vertically for the generic diacritics of alphabetic
scripts, and horizontally for the diacritics of semitic abjads.
If you want to specify another combining mode, then you'll need to encode
some additional combining joiner. But the currently defined combining joiner
does not specify that; instead it just helps resolving the semantic reading
order in a sequence of diacritrics, but does not specify their relative
layout: this is still something that Unicode needs to describe more
formally, with additional joining properties for diacritics, and possibly
the definition and encoding of new combining joiners.
> -----Message d'origine-----
> De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
> part de Agnieszka Kasprzyk
> Envoyé : jeudi 24 mai 2007 14:01
> À : unicode@unicode.org
> Objet : various ways of making a specific character
>
> Hello,
>
> I work for the union catalog of Polish libraries. Our contributors use ISO
> transliteration standards.
> Could you explain me how to deal with those characters from
> transliteration
> standards that do not exist as precomposed characters in Unicode but they
> are combined of others BUT they may be combined in a number of different
> ways. Which is the correct way?
>
> Example:
> ISO 259: 1984 Transliteration of Hebrew characters into Latin characters
>
>
> requires us to enter letter t with dot below and above and letter s with
> dot
> below and above.
> Now each of these characters may be built of:
> a) letter t/s (U+0073/U+0074) + combining dot below (U+0323) + combining
> dot
> above (U+0307)
> b) letter t/s with dot below (U+1E6D/U+1E63)+ combining dot above (U+0307)
> c) letter t/s with dot above (U+1E6B/U+1E61) + combining dot below
> (U+0323)
>
> Other cases are for instance letters with two diacritics one over the
> other.
> Should it be base letter + upper character + lower character, or base
> letter
> + character which is closer + character which is further from the base
> letter, or if it's possible, base letter with one diacritic as one
> character
> + the other diacritic as the combining character?
>
> What is the rule to follow in such cases? Is there any document specifying
> what to do?
>
> I would really appreciate your help with this,
>
> thank you,
>
> Agnieszka Kasprzyk
> mail: a.e.kasprzyk@uw.edu.pl
>
> NUKAT Center, Warsaw University Library, Poland
> http://www.nukat.edu.pl
>
>
> --------------------------------------------------------------------------
> -------------
> Orange vous informe que cet e-mail a ete controle par l'anti-virus mail.
> Aucun virus connu a ce jour par nos services n'a ete detecte.
>
>
>
> --------------------------------------------------------------------------
> -------------
> Orange vous informe que cet e-mail a ete controle par l'anti-virus mail.
> Aucun virus connu a ce jour par nos services n'a ete detecte.
>
>
This archive was generated by hypermail 2.1.5 : Thu May 24 2007 - 11:26:43 CDT