RE: Normalization in Bengali

From: Peter Constable (petercon@microsoft.com)
Date: Tue Nov 14 2006 - 01:19:06 CST

  • Next message: Hans Aberg: "Re: Fonts"

    Canonical combining classes aren't intended to indicate what character
    sequences are "acceptable" -- strictly speaking, all character sequences
    are valid. All that the canonical combining classes are for is to
    provide a folding of different sequences into equivalence classes. The
    upshot of 09C1 and 0981 having canonical combining class = 0 is that
    each differently-ordered sequence involving these characters is in a
    distinct equivalence classes -- i.e., the sequences are not considered
    equivalent.

    Canonical combining classes were set up in a way that, in theory, could
    allow for the satellite signs of an Indic cluster. In practice, though,
    those classes couldn't be applied to characters inherited from ISCII
    because of the presence of "split matra" characters -- characters
    consisting of multiple glyph elements in distinct orientations relative
    to the base: these would require a single character to belong to more
    than one class, which is not possible. The only option is to assign them
    all to class 0.

    Peter

    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    Behalf Of Michael Maxwell
    Sent: Monday, November 13, 2006 11:54 AM
    To: unicode@unicode.org
    Cc: Michael Maxwell
    Subject: Normalization in Bengali

    I have a (newbie) question on normalization of text in Bengali. This
    doesn't have to do with composition vs. decomposition, rather with the
    correct order of characters.

    The question is on which order of dependent vowel + candrabindu is
    correct (or whether both are "correct"). Either I'm not understanding
    the meaning of the records in UnicodeData.txt, or else this doesn't
    determine an order. Here are the records (and for good measure, a
    consonant record):

    09C1;BENGALI VOWEL SIGN U;Mn;0;NSM;;;;;N;;;;;
    0981;BENGALI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;
    0998;BENGALI LETTER GHA;Lo;0;L;;;;;N;;;;;

    As you can see, all three characters have a Canonical Combining Class of
    0. (Can anyone point me to an explanation of why dependent vowels and
    the candrabindu have this class? Looks odd to me, but I'm sure a ton of
    discussion went into deciding it.) Apart from their code point and
    name, the dependent vowel sign and the chadrabindu are identical: both
    have a General Category of 'Mn' = "Mark, Nonspacing", and Bidi class
    values of 'NSM' = "Non-Spacing Mark".

    It looks to me like this is saying either order (GHA-u-CAN or GHA-CAN-u)
    is acceptable. Is this the intended interpretation? (In which case
    there's more work one needs to do in order to be able to compare two
    strings--work in defining a more strict normalization for this and any
    other potentially ambiguous sequences of characters.)

    (FWIW, in Windows XP SP-2 with the Language Pack for complex scripts
    installed, and in Word 2003, the order GHA-u-CAN renders properly with
    the Vindra font, but the order GHA-CAN-u does not. Both orders render
    properly with the Arial Unicode font. I don't know whether this is a
    bug in Vindra, or a buture in the Arial Unicode font.)

       Mike Maxwell
       CASL/ U Md



    This archive was generated by hypermail 2.1.5 : Tue Nov 14 2006 - 01:21:21 CST