From: Michael Maxwell (mmaxwell@casl.umd.edu)
Date: Mon Nov 13 2006 - 13:53:37 CST
I have a (newbie) question on normalization of text in Bengali. This doesn't have to do with composition vs. decomposition, rather with the correct order of characters.
The question is on which order of dependent vowel + candrabindu is correct (or whether both are "correct"). Either I'm not understanding the meaning of the records in UnicodeData.txt, or else this doesn't determine an order. Here are the records (and for good measure, a consonant record):
09C1;BENGALI VOWEL SIGN U;Mn;0;NSM;;;;;N;;;;;
0981;BENGALI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;
0998;BENGALI LETTER GHA;Lo;0;L;;;;;N;;;;;
As you can see, all three characters have a Canonical Combining Class of 0. (Can anyone point me to an explanation of why dependent vowels and the candrabindu have this class? Looks odd to me, but I'm sure a ton of discussion went into deciding it.) Apart from their code point and name, the dependent vowel sign and the chadrabindu are identical: both have a General Category of 'Mn' = "Mark, Nonspacing", and Bidi class values of 'NSM' = "Non-Spacing Mark".
It looks to me like this is saying either order (GHA-u-CAN or GHA-CAN-u) is acceptable. Is this the intended interpretation? (In which case there's more work one needs to do in order to be able to compare two strings--work in defining a more strict normalization for this and any other potentially ambiguous sequences of characters.)
(FWIW, in Windows XP SP-2 with the Language Pack for complex scripts installed, and in Word 2003, the order GHA-u-CAN renders properly with the Vindra font, but the order GHA-CAN-u does not. Both orders render properly with the Arial Unicode font. I don't know whether this is a bug in Vindra, or a buture in the Arial Unicode font.)
Mike Maxwell
CASL/ U Md
This archive was generated by hypermail 2.1.5 : Mon Nov 13 2006 - 13:55:46 CST