From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Tue Jun 28 2005 - 14:33:43 CDT
Tamil Nadu state government collation table
http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html
is the sort order we need to acheieve, (as primary/default sort order).
If we do not have to think of future, if we do not have to take count of 
infrequent usage,
then there is a very simple solution.
Thai is
first sort Independent vowels (அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ)
then sort aytham (ஃ)
then sort pulli  (்)
then sort consonant-a (க ங ச ஞ ட ண த ந ப ம ய ர ல வ ழ ள ற ன)
then sort dependent vowel (ா ி ீ ு ூ ெ ே ை ொ ோ ௌ)
Typical results would be as follows. (If you wish to vie in a text file with 
linear display, please use aAvarangal font (aAvarangal2 is slightly 
different). One do not need to understand nor concern about fully rendered 
display. A linear display is more than enough for development purposes, it 
is easy to understand and easy to test the software.)
sample 1
க்க
ககக
கசக
காக
கிக
sample 2
க்க
ககக
கஙக
கசக
கஞக
காக
கிக
கீக
குக
கூக
கெக
கேக
கைக
கொக
கோக
கௌக
However followings need to be considered.
To be continued ...
Regards
சின்னத்துரை சிறீவாஸ்
----- Original Message ----- 
From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
To: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>; <unicode@unicode.org>
Sent: Monday, June 27, 2005 12:35 AM
Subject: Re: Tamil Collation
> Sinnathurai Srivas wrote:
>
>> Why punishing Tamil for mistakes in Grantham and Unicode?
>>
>>> 0BCA  ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
>>> 0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
>>>
>>> Note that the sorting algorithm will treat them as identical.
>>>
>>> A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.
>>
>> Tamil can process itself at 16 bit (and 8bit)
>
> This is 16 bit processing!  The part of the key for Level 1 comparison 
> gets 0x197B, the part for Level 2 (basically accent comparison) gets 
> 0x002, the part for Level 3 (casing etc.) gets 0x002, and the part for 
> Level 4, which ensures that canonically inequivalent sequences do not 
> compare equal, gets 0xBCA.
>
>> Why this punishment by Grantham. ksh forces Tamil to go even the way of 
>> 48 bit way.
>
> It doesn't.  The start of the 'ksh' entry is sequence of 3 scalar values, 
> those of  KA, VIRAMA, SSA.  The punishment is actually for sharing a 
> planet with Europeans - capitals and accents.  (You can only blame Thais 
> for tone marks, which are treated like accents.  I'm not sure that Thai 
> tone marks weren't based on Vedic accents.)
>
>> Please find ways to stop this nonsense.
>
> Did you try to read the Unicode Collation Algorithm?
>
>> Tamil do not need all these unwanted punishment. We are innocent please.
>>
>> Lets do 16 bit processing. let's stop un-technical canonism.
>> Let's stop vastly complex ksh running havoc with Tamil.
>
>>>>> If Tamil sorting can be expressed purely by a sorting order of 
>>>>> consonants
>>>>> and vowels, then the answer for sorting words is simply to rearrange 
>>>>> the
>>>>> weights on vowels and letters in the default UCA to accord with this
>>> .> ordering.
>>>
>>>> 99% yes.
>>>
>>>> Simply, the pulli (virama!), the dependent vowels, vowels and Aytham 
>>>> need to be weighted and that's it.
>
> That's not true, as you should know full well.  The usual Indic alphabet 
> ends, gathering bits and pieces, YA, RA, LA, VA, SHA, SSA, SA, HA.  Tamil 
> needed to add NNNA, RRA, LLA and LLLA, and unfortunately modern(?) 
> Devanagari has added them in a different order to Tamil.  The default UCA 
> orders the consonants in codepoint order, and then to add to the 
> disagreement Tamil puts the 'Grantha' letters together (so moving JA) and 
> adds 'ksh'.  I believe the basic information may be found in Table 1 at 
> http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html .  Good 
> news is that the ஸ்ரீ ('shri')
> ligature is sorted specially, so collation can reasonably be defined to 
> make the old and new encodings equivalent!
>
> The basic changes needed are to change the weights of the consonants.  We 
> need some extra values - how does one express that in a proposal to change 
> the default algorithm?  For thinking about it, we can use fractional 
> values.
>
> One nasty feature to implement is that consonant plus pulli comes before 
> plain consonant.  The simplest way of capturing this is to change 
> consonant entries in the weighting table such as that for KA from
>
> 0B95  ; [.195C.0020.0002.0B95] # TAMIL LETTER KA
>
> to
>
> 0B95  ; [.195C.0020.0002.0B95][.197E.0020.0002.0BCD] # TAMIL LETTER KA
> 0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>
>
> while retaining
>
> 0BCD  ; [.197E.0020.0002.0BCD] # TAMIL SIGN VIRAMA
>
> for pulli used inappropriately.
>
> This trick effectively replaces TAMIL SIGN VIRAMA by 'TAMIL SIGN NO 
> VIRAMA'.
>
> It's a tad unpleasant in that it lengthens most sort keys.  Another 
> solution is to have an entirely separate weight for consonant plus pulli, 
> e.g.
>
> 0B95  ; [.195CH.0020.0002.0B95] # TAMIL LETTER KA
> 0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>
>
> where H means a half.  (I really am hitting notational problems here. 
> Help!)
>
> There are other details to check, but I hope everyone interested 
> understands roughly what needs doing.
>
> Richard.
> 
This archive was generated by hypermail 2.1.5 : Tue Jun 28 2005 - 16:26:49 CDT