Re: Tamil Collation - Analysis

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Tue Jun 28 2005 - 15:23:26 CDT

Next message: Michael \(michka\) Kaplan: "Re: Tamil sha (U+0BB6) - deprecate it?"

Previous message: Sinnathurai Srivas: "Re: A Tamil-Roman transliterator (Unicode)"
Maybe in reply to: Sinnathurai Srivas: "Tamil Collation - Analysis"
Next in thread: David Starner: "Re: Tamil Collation vs Transliteration/Transcription Enc Version2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I'm recalling this message.

Please moderator, if you see this do not approve this and my previous mail
with this heading, ending with Analysis

Kind Regards
Sinnathurai Srivas

----- Original Message -----
From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>
To: <unicode@unicode.org>
Sent: Tuesday, June 28, 2005 8:33 PM
Subject: Tamil Collation - Analysis

> Tamil Nadu state government collation table
> http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html
> is the sort order we need to acheieve, (as primary/default sort order).
>
> If we do not have to think of future, if we do not have to take count of
> infrequent usage,
> then there is a very simple solution.
> Thai is
>
> first sort Independent vowels (அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ)
> then sort aytham (ஃ)
> then sort pulli (்)
> then sort consonant-a (க ங ச ஞ ட ண த ந ப ம ய ர ல வ ழ ள ற ன)
> then sort dependent vowel (ா ி ீ ு ூ ெ ே ை ொ ோ ௌ)
>
> Typical results would be as follows. (If you wish to vie in a text file
> with linear display, please use aAvarangal font (aAvarangal2 is slightly
> different). One do not need to understand nor concern about fully rendered
> display. A linear display is more than enough for development purposes, it
> is easy to understand and easy to test the software.)
>
> sample 1
> க்க
> ககக
> கசக
> காக
> கிக
>
>
> sample 2
> க்க
> ககக
> கஙக
> கசக
> கஞக
> காக
> கிக
> கீக
> குக
> கூக
> கெக
> கேக
> கைக
> கொக
> கோக
> கௌக
>
> However followings need to be considered.
> To be continued ...
>
> Regards
> சின்னத்துரை சிறீவாஸ்
>
> ----- Original Message -----
> From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
> To: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>;
> <unicode@unicode.org>
> Sent: Monday, June 27, 2005 12:35 AM
> Subject: Re: Tamil Collation
>
>
>> Sinnathurai Srivas wrote:
>>
>>> Why punishing Tamil for mistakes in Grantham and Unicode?
>>>
>>>> 0BCA ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
>>>> 0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
>>>>
>>>> Note that the sorting algorithm will treat them as identical.
>>>>
>>>> A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.
>>>
>>> Tamil can process itself at 16 bit (and 8bit)
>>
>> This is 16 bit processing! The part of the key for Level 1 comparison
>> gets 0x197B, the part for Level 2 (basically accent comparison) gets
>> 0x002, the part for Level 3 (casing etc.) gets 0x002, and the part for
>> Level 4, which ensures that canonically inequivalent sequences do not
>> compare equal, gets 0xBCA.
>>
>>> Why this punishment by Grantham. ksh forces Tamil to go even the way of
>>> 48 bit way.
>>
>> It doesn't. The start of the 'ksh' entry is sequence of 3 scalar values,
>> those of KA, VIRAMA, SSA. The punishment is actually for sharing a
>> planet with Europeans - capitals and accents. (You can only blame Thais
>> for tone marks, which are treated like accents. I'm not sure that Thai
>> tone marks weren't based on Vedic accents.)
>>
>>> Please find ways to stop this nonsense.
>>
>> Did you try to read the Unicode Collation Algorithm?
>>
>>> Tamil do not need all these unwanted punishment. We are innocent please.
>>>
>>> Lets do 16 bit processing. let's stop un-technical canonism.
>>> Let's stop vastly complex ksh running havoc with Tamil.
>>
>>>>>> If Tamil sorting can be expressed purely by a sorting order of
>>>>>> consonants
>>>>>> and vowels, then the answer for sorting words is simply to rearrange
>>>>>> the
>>>>>> weights on vowels and letters in the default UCA to accord with this
>>>> .> ordering.
>>>>
>>>>> 99% yes.
>>>>
>>>>> Simply, the pulli (virama!), the dependent vowels, vowels and Aytham
>>>>> need to be weighted and that's it.
>>
>> That's not true, as you should know full well. The usual Indic alphabet
>> ends, gathering bits and pieces, YA, RA, LA, VA, SHA, SSA, SA, HA. Tamil
>> needed to add NNNA, RRA, LLA and LLLA, and unfortunately modern(?)
>> Devanagari has added them in a different order to Tamil. The default UCA
>> orders the consonants in codepoint order, and then to add to the
>> disagreement Tamil puts the 'Grantha' letters together (so moving JA) and
>> adds 'ksh'. I believe the basic information may be found in Table 1 at
>> http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html . Good
>> news is that the ஸ்ரீ ('shri')
>> ligature is sorted specially, so collation can reasonably be defined to
>> make the old and new encodings equivalent!
>>
>> The basic changes needed are to change the weights of the consonants. We
>> need some extra values - how does one express that in a proposal to
>> change the default algorithm? For thinking about it, we can use
>> fractional values.
>>
>> One nasty feature to implement is that consonant plus pulli comes before
>> plain consonant. The simplest way of capturing this is to change
>> consonant entries in the weighting table such as that for KA from
>>
>> 0B95 ; [.195C.0020.0002.0B95] # TAMIL LETTER KA
>>
>> to
>>
>> 0B95 ; [.195C.0020.0002.0B95][.197E.0020.0002.0BCD] # TAMIL LETTER KA
>> 0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>
>>
>> while retaining
>>
>> 0BCD ; [.197E.0020.0002.0BCD] # TAMIL SIGN VIRAMA
>>
>> for pulli used inappropriately.
>>
>> This trick effectively replaces TAMIL SIGN VIRAMA by 'TAMIL SIGN NO
>> VIRAMA'.
>>
>> It's a tad unpleasant in that it lengthens most sort keys. Another
>> solution is to have an entirely separate weight for consonant plus pulli,
>> e.g.
>>
>> 0B95 ; [.195CH.0020.0002.0B95] # TAMIL LETTER KA
>> 0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>
>>
>> where H means a half. (I really am hitting notational problems here.
>> Help!)
>>
>> There are other details to check, but I hope everyone interested
>> understands roughly what needs doing.
>>
>> Richard.
>>
>

Next message: Michael \(michka\) Kaplan: "Re: Tamil sha (U+0BB6) - deprecate it?"
Previous message: Sinnathurai Srivas: "Re: A Tamil-Roman transliterator (Unicode)"
Maybe in reply to: Sinnathurai Srivas: "Tamil Collation - Analysis"
Next in thread: David Starner: "Re: Tamil Collation vs Transliteration/Transcription Enc Version2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jun 28 2005 - 16:32:12 CDT