From: Peter Constable (petercon@microsoft.com)
Date: Tue Nov 14 2006 - 01:19:06 CST
Canonical combining classes aren't intended to indicate what character
sequences are "acceptable" -- strictly speaking, all character sequences
are valid. All that the canonical combining classes are for is to
provide a folding of different sequences into equivalence classes. The
upshot of 09C1 and 0981 having canonical combining class = 0 is that
each differently-ordered sequence involving these characters is in a
distinct equivalence classes -- i.e., the sequences are not considered
equivalent.
Canonical combining classes were set up in a way that, in theory, could
allow for the satellite signs of an Indic cluster. In practice, though,
those classes couldn't be applied to characters inherited from ISCII
because of the presence of "split matra" characters -- characters
consisting of multiple glyph elements in distinct orientations relative
to the base: these would require a single character to belong to more
than one class, which is not possible. The only option is to assign them
all to class 0.
Peter
-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
Behalf Of Michael Maxwell
Sent: Monday, November 13, 2006 11:54 AM
To: unicode@unicode.org
Cc: Michael Maxwell
Subject: Normalization in Bengali
I have a (newbie) question on normalization of text in Bengali. This
doesn't have to do with composition vs. decomposition, rather with the
correct order of characters.
The question is on which order of dependent vowel + candrabindu is
correct (or whether both are "correct"). Either I'm not understanding
the meaning of the records in UnicodeData.txt, or else this doesn't
determine an order. Here are the records (and for good measure, a
consonant record):
09C1;BENGALI VOWEL SIGN U;Mn;0;NSM;;;;;N;;;;;
0981;BENGALI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;
0998;BENGALI LETTER GHA;Lo;0;L;;;;;N;;;;;
As you can see, all three characters have a Canonical Combining Class of
0. (Can anyone point me to an explanation of why dependent vowels and
the candrabindu have this class? Looks odd to me, but I'm sure a ton of
discussion went into deciding it.) Apart from their code point and
name, the dependent vowel sign and the chadrabindu are identical: both
have a General Category of 'Mn' = "Mark, Nonspacing", and Bidi class
values of 'NSM' = "Non-Spacing Mark".
It looks to me like this is saying either order (GHA-u-CAN or GHA-CAN-u)
is acceptable. Is this the intended interpretation? (In which case
there's more work one needs to do in order to be able to compare two
strings--work in defining a more strict normalization for this and any
other potentially ambiguous sequences of characters.)
(FWIW, in Windows XP SP-2 with the Language Pack for complex scripts
installed, and in Word 2003, the order GHA-u-CAN renders properly with
the Vindra font, but the order GHA-CAN-u does not. Both orders render
properly with the Arial Unicode font. I don't know whether this is a
bug in Vindra, or a buture in the Arial Unicode font.)
Mike Maxwell
CASL/ U Md
This archive was generated by hypermail 2.1.5 : Tue Nov 14 2006 - 01:21:21 CST