From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon Jun 27 2005 - 16:34:16 CDT
N. Ganesan wrote:
>Pl. see a collation chart for Tamil:
> http://nganesan.thamizamuthu.com/docs/TamilCollationChart.html
> Or, in pdf form:
> thamizh@sbcglobal.net/TamilCollationChart.pdf">thamizh@sbcglobal.net/TamilCollationChart.pdf">http://www.geocities.com/thamizh@sbcglobal.net/TamilCollationChart.pdf
ie.
http://www.geocities.com/thamizh[AT]sbcglobal.net/TamilCollationChart.pdf
> I'd love to know when will the SHA (u+0bb6) Uniscribe be updated and SHA
> will work in Windows correctly? Fixing Uniscribe to render SHA series in
> Tamil script - is it a job to be done by companies like Microsoft?
Uniscribe belongs to Microsoft, and I haven't heard of anyone offering an
alternative version.
> Like Thai, Tamil also employs in majority, and in a wide class of
> applications (eg., loans from English, the West or Islamic world) "ksh"
> only as non-conjunct. So we at INFITT are discussing a proposal to make
> the non-conjunct KSHA as default, and to create conjugated ksha with ZWJ.
> The majority behaviour of ksha as non-conjunct is in Tamil, but the
> non-conjunct ksha is not known in other Indic scripts. It is a Tamil
> special.
As far as I can make out, and FWIW Uniscribe agrees with me, both ZWJ and
ZWNJ specify the form with visible pulli. Are க்ஷ் and க்ஷ் sorted
differently, as your link implies? If so is க்ஷ் truly sorted differently
to what one might expect of a mere sequence of க் and ஷ்?
Working from http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html
, I thought I had sorted out the requirement and solution:
1. Tamil standard
Collating order is:
A. ASCII: SP ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
[ \ ] ^ _ { } ~
B. Miscellaneous marks DAY (U+0BF3) to Number sign (U+0BFA)
Current Level 1 weights: *03AD (day) to *03B3 (number sign)
C. Numbers (incl 10, 100 etc)
Current Level 1 weights: 0F62 (0) to 0F6B (9)
but then *0EC9 (10), *0ECA (100), *0ECB (1000)
D. Words:
Anusvara - current Levels 1 and 2 weights: [0000.0120]
Aytham - 194F
Vowel letters - Current Level 1 weights 1950 to 195B
Consonant letters and vowel signs - in binary order, current Level 1
weights 195C to 197D
Pulli - Current Level 1 weight:197E
Stray length mark - Current Level 1 weight 197F
Solution Approach:
1. Treatment of ASCII must be reserved to full Tamil customisation.
2. Query ignoring of the miscellaneous marks.
3. Query treatment and ordering of powers of 10. Why are they treated as
variable?
Why sorted before decimal digits if selected as non-ignorable?
4. Words:
a) Leaving as at present probably does least harm.
b) Assign weights in the following ascending sequence:
(i) For each (NFC) vowel letter in binary order U+0B85 to U+0B94.
(ii) Aytham (U+0B83)
(iii) For each consonant and ligature KSHA, in order
KA, NGA, CA, NYA, TTA, NNA, TA, NA, PA, MA, YA, RA, LA, VA;
(Indian Sprachbund sounds, in standard Indic order)
LLLA, LLA, RRA, NNNA; (specifically Dravidian sounds)
JA, SHA, SSA, SA, HA, KSHA ('Grantha' letters, in standard Indic
order):
(A) Consonant plus virama (i.e. visible pulli)
(B) Consonant
(iv) SHRI ligature (whether spelt with SSA or SHA - possibly make
difference a second level matter)
(v) For each (NFC) dependent vowel sign in binary order U+0BBE to
U+0BCC
(vi) Virama (for irregular spellings only)
(vii) Tamil AU length mark (for irregular spellings only)
If K-SHA and KSHA are as complicated as implied by
http://nganesan.thamizamuthu.com/docs/TamilCollationChart.html I'll have to
do some thinking. Are the differences at Level1 or Level 2? It's a shame
that the rendering for the HTML version is broken - the KSHA ligature did
not form! (I'm not totally sold on the idea that Tamil letters are
soft-dotted, that TAMIL VOWEL SIGN A ought to have been an invisible
superscript, and that Tamil vowel signs are all superscript. :) If ZWJ
ought to yield rather than inhibit ligation, the 'contractions' for KSHA
will have to include sequences with ZWJ.
The next step should be to code up and run a revised set of collation
elements (allkeys.txt), but I don't have a Tamil dictionary to test the
collation against.
I can't decide whether it is right to ignore non-decimal numbers in
collation (until Level 4). That rule seems to apply to all but Greek, Roman
and CJK numbers. I don't know enough about Tamil non-positional number
notation to comment.
Richard.
This archive was generated by hypermail 2.1.5 : Mon Jun 27 2005 - 17:06:31 CDT