Re: PRC asking for 956 precomposed Tibetan characters

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Wed Jan 08 2003 - 04:47:37 EST

  • Next message: Michael Everson: "Re: Unicode Standards for Indic Scripts"

    ------- Start of forwarded message -------

    From: "Robert R. Chilton" <acip@well.com>
    Date: Wed, 08 Jan 2003 00:16:35 -0500
    Cc: unicode@unicode.org, tibex@unicode.org
    Subject: Re: PRC asking for 956 precomposed Tibetan characters
    To: "Andrew C. West" <andrewcwest@alumni.princeton.edu>

    Andrew C. West wrote:
    >
    > On Tue, 07 Jan 2003 06:16:43 -0800 (PST), "Robert R. Chilton" wrote:
    >
    > > I understand your interest in preserving the semantic or lexical
    > > distinction between an instance of a contracted series of single vowels
    > > and a true usage of the double vowel. However, the procedure of
    > > normalization is designed to collapse all the variant encodings for a
    > > particular presentation form into a single, "normalized" encoding.
    > > ...
    > > Canonical combining classes are defined for combining characters (such
    > > as macron and dot-under, or the vowel signs of Tibetan) in order to
    > > support normalization of identical presentation forms to a single
    > > encoding. So in the cases you cite, of "graphically identical but
    > > semantically different" instances, consistency in searching, sorting,
    > > etc. requires that all "graphically identical" presentation forms be
    > > normalized to a single normalized encoding.
    > >
    >
    > O.K. Your explanation of normalisation makes sense, and I'll change the
    encoding
    > of double and triple E and O vowel signs accordingly on my web pages. The only
    > query I still have is why a triple E vowel sign should be normalised to
    <U+0F7B,
    > U+0F7A> rather than <U+0F7A, U+0F7B> ? What determines that the former sequence
    > is better than the latter sequence ?
    >
    > Andrew

    In the normalization process instances of a sequence of either two E
    vowels or two O vowels may be normalized to double E vowel or double O
    vowel. Thus, in a case of three E or three O vowels in sequence the
    first two would be normalized to the double vowel with the single vowel
    trailing.

    Unfortunately, since the single and double vowel characters are assigned
    the same canonical combining class of 130, a further step of processing
    is required in order that any sequence of e.g., <U+0F7A, U+0F7B> be
    normalized to <U+0F7B, U+0F7A>. So here again is a case where it would
    be desirable to alter some of the canonical combining classes that have
    been assigned to characters in the Tibetan block.

    If it is indeed not possible to assign new canonical combining classes
    to U+0F7B and U+0F7D then it may be preferrable to specify (or otherwise
    obtain by processing) a canonical or compatibility decomposition for
    these two characters as <U+0F7A, U+0F7A> and <U+0F7C, U+0F7C>,
    respectively, and deprecate the use of these double-vowel characters.

    Kind regards,
    Robert

    ------- End of forwarded message -------



    This archive was generated by hypermail 2.1.5 : Wed Jan 08 2003 - 05:38:19 EST