Tamil Collation - Analysis

From: Sinnathurai Srivas ([email protected])
Date: Tue Jun 28 2005 - 14:33:43 CDT

Next message: Sinnathurai Srivas: "Re: A Tamil-Roman transliterator (Unicode)"

Previous message: Richard Wordingham: "Re: Numbered consonants in Tamil script abugida series"
In reply to: Richard Wordingham: "Re: Tamil Collation"
Next in thread: Sinnathurai Srivas: "Re: Tamil Collation - Analysis"
Maybe reply: Sinnathurai Srivas: "Re: Tamil Collation - Analysis"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Tamil Nadu state government collation table
http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html
is the sort order we need to acheieve, (as primary/default sort order).

If we do not have to think of future, if we do not have to take count of
infrequent usage,
then there is a very simple solution.
Thai is

first sort Independent vowels (அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ)
then sort aytham (ஃ)
then sort pulli (்)
then sort consonant-a (க ங ச ஞ ட ண த ந ப ம ய ர ல வ ழ ள ற ன)
then sort dependent vowel (ா ி ீ ு ூ ெ ே ை ொ ோ ௌ)

Typical results would be as follows. (If you wish to vie in a text file with
linear display, please use aAvarangal font (aAvarangal2 is slightly
different). One do not need to understand nor concern about fully rendered
display. A linear display is more than enough for development purposes, it
is easy to understand and easy to test the software.)

sample 1
க்க
ககக
கசக
காக
கிக

sample 2
க்க
ககக
கஙக
கசக
கஞக
காக
கிக
கீக
குக
கூக
கெக
கேக
கைக
கொக
கோக
கௌக

However followings need to be considered.
To be continued ...

Regards
சின்னத்துரை சிறீவாஸ்

----- Original Message -----
From: "Richard Wordingham" <[email protected]>
To: "Sinnathurai Srivas" <[email protected]>; <[email protected]>
Sent: Monday, June 27, 2005 12:35 AM
Subject: Re: Tamil Collation

> Sinnathurai Srivas wrote:
>
>> Why punishing Tamil for mistakes in Grantham and Unicode?
>>
>>> 0BCA ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
>>> 0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
>>>
>>> Note that the sorting algorithm will treat them as identical.
>>>
>>> A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.
>>
>> Tamil can process itself at 16 bit (and 8bit)
>
> This is 16 bit processing! The part of the key for Level 1 comparison
> gets 0x197B, the part for Level 2 (basically accent comparison) gets
> 0x002, the part for Level 3 (casing etc.) gets 0x002, and the part for
> Level 4, which ensures that canonically inequivalent sequences do not
> compare equal, gets 0xBCA.
>
>> Why this punishment by Grantham. ksh forces Tamil to go even the way of
>> 48 bit way.
>
> It doesn't. The start of the 'ksh' entry is sequence of 3 scalar values,
> those of KA, VIRAMA, SSA. The punishment is actually for sharing a
> planet with Europeans - capitals and accents. (You can only blame Thais
> for tone marks, which are treated like accents. I'm not sure that Thai
> tone marks weren't based on Vedic accents.)
>
>> Please find ways to stop this nonsense.
>
> Did you try to read the Unicode Collation Algorithm?
>
>> Tamil do not need all these unwanted punishment. We are innocent please.
>>
>> Lets do 16 bit processing. let's stop un-technical canonism.
>> Let's stop vastly complex ksh running havoc with Tamil.
>
>>>>> If Tamil sorting can be expressed purely by a sorting order of
>>>>> consonants
>>>>> and vowels, then the answer for sorting words is simply to rearrange
>>>>> the
>>>>> weights on vowels and letters in the default UCA to accord with this
>>> .> ordering.
>>>
>>>> 99% yes.
>>>
>>>> Simply, the pulli (virama!), the dependent vowels, vowels and Aytham
>>>> need to be weighted and that's it.
>
> That's not true, as you should know full well. The usual Indic alphabet
> ends, gathering bits and pieces, YA, RA, LA, VA, SHA, SSA, SA, HA. Tamil
> needed to add NNNA, RRA, LLA and LLLA, and unfortunately modern(?)
> Devanagari has added them in a different order to Tamil. The default UCA
> orders the consonants in codepoint order, and then to add to the
> disagreement Tamil puts the 'Grantha' letters together (so moving JA) and
> adds 'ksh'. I believe the basic information may be found in Table 1 at
> http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html . Good
> news is that the ஸ்ரீ ('shri')
> ligature is sorted specially, so collation can reasonably be defined to
> make the old and new encodings equivalent!
>
> The basic changes needed are to change the weights of the consonants. We
> need some extra values - how does one express that in a proposal to change
> the default algorithm? For thinking about it, we can use fractional
> values.
>
> One nasty feature to implement is that consonant plus pulli comes before
> plain consonant. The simplest way of capturing this is to change
> consonant entries in the weighting table such as that for KA from
>
> 0B95 ; [.195C.0020.0002.0B95] # TAMIL LETTER KA
>
> to
>
> 0B95 ; [.195C.0020.0002.0B95][.197E.0020.0002.0BCD] # TAMIL LETTER KA
> 0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>
>
> while retaining
>
> 0BCD ; [.197E.0020.0002.0BCD] # TAMIL SIGN VIRAMA
>
> for pulli used inappropriately.
>
> This trick effectively replaces TAMIL SIGN VIRAMA by 'TAMIL SIGN NO
> VIRAMA'.
>
> It's a tad unpleasant in that it lengthens most sort keys. Another
> solution is to have an entirely separate weight for consonant plus pulli,
> e.g.
>
> 0B95 ; [.195CH.0020.0002.0B95] # TAMIL LETTER KA
> 0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>
>
> where H means a half. (I really am hitting notational problems here.
> Help!)
>
> There are other details to check, but I hope everyone interested
> understands roughly what needs doing.
>
> Richard.
>

Next message: Sinnathurai Srivas: "Re: A Tamil-Roman transliterator (Unicode)"
Previous message: Richard Wordingham: "Re: Numbered consonants in Tamil script abugida series"
In reply to: Richard Wordingham: "Re: Tamil Collation"
Next in thread: Sinnathurai Srivas: "Re: Tamil Collation - Analysis"
Maybe reply: Sinnathurai Srivas: "Re: Tamil Collation - Analysis"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jun 28 2005 - 16:26:49 CDT