Combining class for Thai characters

From: Samphan Raruenrom (samphan@thai.com)
Date: Tue May 21 2002 - 11:07:32 EDT


Hi,

I have something to consult with you about the properties of Thai
characters in Unicode.

 From the UnicodeData.txt below :-
The (above-attached) vowel signs "MAI HAN-AKAT, SARA I, II, UE, UEE" have combining class 0?
The (below-attached) vowel signs "SARA U, UU" have combining class 103
The (below-attached) tone marks "MAI EK, THO, TRI, CHATTAWA" have combining class 107
The (below-attached) virama "PHINTHU" have combining class 9 (why the data says it's a vowel sign?)
The (above-attached) marks "MAITAIKHU, THANTHAKHAT, NIKHAHIT and YAMAKKAN" also have combining class 0?
The (split) "SARA AM" is not a combining character but an Lo

My first question is :-
Why the above-attached vowel signs/marks all have combining class 0?

This inhibits them from participating in normalizations, right?

Examples :-
The sequences (both of which should look the same on non-WTT shaping engine) :-
(1) KO KAI + SARA UU + MAI EK -> กู่ -> combining class = 0, 103, 107
(2) KO KAI + MAI EK + SARA UU -> กู่ -> combining class = 0, 107, 103

While Unicode doesn't have the notion of invalid sequence, Thai has one, defined by a
national standard (WTT) to be (approximately) :
CONSONANT + (above or below) VOWEL SIGN + TONE MARK or THANTHAKHAT
The same concept occurs in, for example, Devanagari (Unicode 3.0 book, page 219, says
(correct me if I'm wrong) that the memory representation of a syllable 'should' be :-
CONSONANT + NUKTA + VIRAMA + VOWEL SIGNS + BINDU + SAVARA).

So (correct me if I'm wrong) the notion of invalid sequence in Unicode is script-specific.
And it is (is it?) intended that the normalized sequences should (as much as possible?)
be correct for the particular scripts; otherwise, the normalized text will be rendered
differently from the un-normalized text (do they have to?).

This works for the above sequences, both (1) and (2) normalized to (1).
But for the following sequences :-
(3) KO KAI + SARA II + MAI EK -> กี่ -> combining class = 0, 0, 107
(4) KO KAI + MAI EK + SARA II -> ก่ี -> combining class = 0, 107, 0

They should both be normalized to (3) but not, because class 0 does not participate in
reordering (they are both normalized). It's possible to correct this by assigning
above-attaced vowel signs (i.e. SARA II) with combining class more than 0.
Or, according to the Unicode (and Thai) convention that order below marks before above
marks, the combining class of above vowels should be more than 103 (below vowels) and
less than 107 (tone marks, which always above-attached).
Or if it's intended that the above vowel and tone mark should be stacked according
to the Unicode default inside-out rule, both should have the same combining class 107
to let them interact typograhically.

Am I right?

>8---------------- excerpt from UnicodeData.txt ----------------------8<
0E01;THAI CHARACTER KO KAI;Lo;0;L;;;;;N;THAI LETTER KO KAI;;;;
...
0E30;THAI CHARACTER SARA A;Lo;0;L;;;;;N;THAI VOWEL SIGN SARA A;;;;
0E31;THAI CHARACTER MAI HAN-AKAT;Mn;0;NSM;;;;;N;THAI VOWEL SIGN MAI HAN-AKAT;;;;
0E32;THAI CHARACTER SARA AA;Lo;0;L;;;;;N;THAI VOWEL SIGN SARA AA;;;;
0E33;THAI CHARACTER SARA AM;Lo;0;L;<compat> 0E4D 0E32;;;;N;THAI VOWEL SIGN SARA AM;;;;
0E34;THAI CHARACTER SARA I;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA I;;;;
0E35;THAI CHARACTER SARA II;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA II;;;;
0E36;THAI CHARACTER SARA UE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UE;;;;
0E37;THAI CHARACTER SARA UEE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UEE;sara uue;;;
0E38;THAI CHARACTER SARA U;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA U;;;;
0E39;THAI CHARACTER SARA UU;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA UU;;;;
0E3A;THAI CHARACTER PHINTHU;Mn;9;NSM;;;;;N;THAI VOWEL SIGN PHINTHU;;;;
0E3F;THAI CURRENCY SYMBOL BAHT;Sc;0;ET;;;;;N;THAI BAHT SIGN;;;;
0E40;THAI CHARACTER SARA E;Lo;0;L;;;;;N;THAI VOWEL SIGN SARA E;;;;
0E41;THAI CHARACTER SARA AE;Lo;0;L;;;;;N;THAI VOWEL SIGN SARA AE;;;;
0E42;THAI CHARACTER SARA O;Lo;0;L;;;;;N;THAI VOWEL SIGN SARA O;;;;
0E43;THAI CHARACTER SARA AI MAIMUAN;Lo;0;L;;;;;N;THAI VOWEL SIGN SARA MAI MUAN;sara ai mai muan;;;
0E44;THAI CHARACTER SARA AI MAIMALAI;Lo;0;L;;;;;N;THAI VOWEL SIGN SARA MAI MALAI;sara ai mai malai;;;
0E45;THAI CHARACTER LAKKHANGYAO;Lo;0;L;;;;;N;THAI LAK KHANG YAO;lakkhang yao;;;
0E46;THAI CHARACTER MAIYAMOK;Lm;0;L;;;;;N;THAI MAI YAMOK;mai yamok;;;
0E47;THAI CHARACTER MAITAIKHU;Mn;0;NSM;;;;;N;THAI VOWEL SIGN MAI TAI KHU;mai taikhu;;;
0E48;THAI CHARACTER MAI EK;Mn;107;NSM;;;;;N;THAI TONE MAI EK;;;;
0E49;THAI CHARACTER MAI THO;Mn;107;NSM;;;;;N;THAI TONE MAI THO;;;;
0E4A;THAI CHARACTER MAI TRI;Mn;107;NSM;;;;;N;THAI TONE MAI TRI;;;;
0E4B;THAI CHARACTER MAI CHATTAWA;Mn;107;NSM;;;;;N;THAI TONE MAI CHATTAWA;;;;
0E4C;THAI CHARACTER THANTHAKHAT;Mn;0;NSM;;;;;N;THAI THANTHAKHAT;;;;
0E4D;THAI CHARACTER NIKHAHIT;Mn;0;NSM;;;;;N;THAI NIKKHAHIT;nikkhahit;;;
0E4E;THAI CHARACTER YAMAKKAN;Mn;0;NSM;;;;;N;THAI YAMAKKAN;;;;
0E4F;THAI CHARACTER FONGMAN;Po;0;L;;;;;N;THAI FONGMAN;;;;
>8--------------------------------------------------------------------8<

Regards,
Samphan Raruenrom
Information Research and Development Division
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html



This archive was generated by hypermail 2.1.2 : Tue May 21 2002 - 11:56:17 EDT