Re: Ligatures and Decompositions

From: verdy_p (verdy_p@wanadoo.fr)
Date: Mon Aug 17 2009 - 09:37:17 CDT

  • Next message: Shriramana Sharma: "Re: Request change name of (as yet unpublished) 1CD3 VEDIC SIGN NIHSHVASA"

    "Jukka K. Korpela" wrote:
    > Michael Everson wrote:
    >
    > > On 14 Aug 2009, at 17:08, Andrew Miller wrote:
    > >
    > >> Is there any reason why U+A732 LATIN CAPITAL LETTER AA and other
    > >> characters in the Latin Extended-D block don't have decompositions
    > >> like e.g. " 0041 0041"?
    > >
    > > They're not decomposable nor meant to be.
    >
    > (...)
    >
    > On the other hand, the particular question is probably only relevant to some
    > medievalists. Probably some use of a symbol consisting of two A's (partly
    > overlapping) has been described so that it has been accepted with an
    > identity of its own.

    The fact that many existing ligatures can have a compatibility decomposition into separate letters is justified by
    the way they are effectively perceived in some languages (or script styles) that use them in a non-constrasting way.

    For example the oe and ae ligatures are effectively perceived as pairs of letters in French, where they may
    eventually pronnounced out either as one or two vowels, or where only one is effectively pronounced, the other one
    being kept silent but still written for etymological reasons or to avoid the interpretation of one of its components
    within digraphs; this is effectively reflected in the French collation for sorting words in dictionnaries.

    This is not true in all languages where they are used to note distinct unbreakable single vowels; for this reason
    the decomposition, when it is justified, cannot be canonical without breaking those languages.

    (1) If a ligature MUST always be interpreted as a single letter, the previous solution with joiners and disjoiners
    is not applicable. You need a new character, but you don't need this decomposition.

    (2) If a ligature MUST always be interpreted as a 2-letter pair, you don't need to encode it, the ligatures are a
    script style feature, generally optional, and there's no constrasting interpretations. This is the case of the
    'ff', 'fi', 'fl', ''ffl', 'ffi' or 'long s-t' ligatures, for example. You may still control the existence of
    ligatures with format controls (zero-width joiners and disjoiners), used only as hints for renderers.

    (2.1) If this ligature is still encoded, this is for compatibility reasons with other encoding standards that
    don't have at least the same rich level of description and rich set of properties as Unicode (in this case this does
    not prohibit the addition of a canonical decomposition).

    (3) If a ligature MAY or MAY NOT be interpreted as a single letter depending on the context of use, the character
    also needs to be encoded. The interpretation as a 2-letter pair will be possible only though compatibility
    decompositions. This is the case of the 'ae' or oe' ligatures.

    Once ligatures have been encoded, for the reasons given in (1), (2.1) or (3), their canonical or compatibility
    decompositions cannot be changed. If contrasting examples are found that break such principles, there are several
    solutions:
    * adapt the UCA collation tables for the languages that need this exception, no new character is needed.
    * encode a new ligature with:
    ** either the new compatibility decomposition (a priori, this very weak justification would be insufficient for the
    deunification, given the power of the UCA algorithm and the possibilities of tailoring for sepcific languages);
    ** or the removed canonical decomposition (this should occur now extremely rarely, but it may be a valid and very
    strong reason for its separate encoding/deunification).

    Philippe



    This archive was generated by hypermail 2.1.5 : Mon Aug 17 2009 - 09:39:27 CDT