Re: Major Defect in Combining Classes of Tibetan Vowels

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jun 25 2003 - 12:07:04 EDT

  • Next message: Michael Everson: "Re: Major Defect in Combining Classes of Tibetan Vowels"

    On Wednesday, June 25, 2003 4:31 PM, Andrew C. West <andrewcwest@alumni.princeton.edu> wrote:
    > On Wed, 25 Jun 2003 15:05:26 +0400, "Valeriy E. Ushakov" wrote:
    > What I'm suggesting is that although "cui" <0F45, 0F74, 0F72> and
    > "ciu" <0F45, 0F72, 0F74> should be rendered identically, the logical
    > ordering of the codepoints representing the vowels may represent
    > lexical differences that would be lost during the process of
    > normalisation.

    This is an excellent argument, and that's why the Vietnamese usage of multiple diacritics was studied so that it can preserve the logical ordering of accents on Latin letters. However if the actual rendered text cannot be distinguished, the effective order of diacritics is only important in the mind of the reader but does not exist in the written form.

    This would be important if there was a need to create a transliteration rule (for example from Tibetan to Latin script). But even in that case, knowledge of the origin language is required, as no transliteration rule works well usig only the script information. So transliteration rules are very often context-sensitive.

    What is important is how a native Tibetan reader would read the grapheme cluster. If it reads it as "ciu" then it is to be interpreted as "ciu", and then the logical order is more important than the encoding order, because such difference do not exist in the actual written script.

    If I just take the example of the Latin script, a sequence like <C, COMBINING CEDILLA, COMBINING ACCUTE ACCENT> will have a canonical order for the two last diacritics which is not important at the linguisitic level if looking at the written script. The canonical order and comining classes just exists BECAUSE the encoding would allow several *equivalent* sequences that no reader would be allow to read distinctly. When there is possible confusions, and these distinction does not exist in the original script before its encoding, there should exist a way to unify all these.

    So even if the canonical ordering of Tibetan vowel signs is not logical, as long as it allows to produce the same written text, this is not a problem, and there is not more loss of semantic than in the original script.

    So if the Tibetan script cannot make a distinction between "ciu" and "cui", this is *not* a Unicode defect. This confusion already exists in the original script, and there is no loss of semantic in the Unicode encoding when compared to the actual written script. Let's not make a problem by adding new semantics to the Tibetan language (such as creating a distinction between "ciu" and "cui") *because* this seems /possible/ in Unicode. If we respect a script or language, we must not tolerate such artificial distinctions.

    It's true that the canonical ordering should match with the logical ordering, but I think that there is a lot of exceptions, notably within Brahmic scripts with disjoint letters, or in Thai (encoded according to a previous existing standard TIS620 which used the visual ordering), or even in many Hebrew or Arabic texts (sometimes encoded also with a visual ordering, and requiring some tools to reverse the encoding according to a prefered order, because this cannot be decided without an out-of-band specification of the actual ordering used in the text)...

    So if one wants to really handle the logical ordering, it's perfectly possible to exchange the "i" and "u" in "cui" without affecting the canonical equivalence and without changing the semantic of the original Tibetan text. Canonical ordering is only needed to unify equivalences, but is not intended to sort distinct strings (this is not part of the Unicode encoding, but part of a collation algorithm like UCA, tailored appropriately for each language on top of the default UCA order for the script).

    A correct UCA collation for the Tibetan script can perfectly be created, and then tailored for the Tibetan language to reorder the vowel signs. (This is not more complicated than handling a French reordering for accents). This just requires a multi-level sort algorithm, where "u" and "i" would have the same collation keys at level N, and could be reordered using a French-style reordering of vowel signs for keywords or grapheme clusters at level N+1 or N+2.



    This archive was generated by hypermail 2.1.5 : Wed Jun 25 2003 - 13:03:00 EDT