Re: Tamil Collation - Analysis

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Tue Jun 28 2005 - 15:23:26 CDT

  • Next message: Michael \(michka\) Kaplan: "Re: Tamil sha (U+0BB6) - deprecate it?"

    I'm recalling this message.

    Please moderator, if you see this do not approve this and my previous mail
    with this heading, ending with Analysis

    Kind Regards
    Sinnathurai Srivas

    ----- Original Message -----
    From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>
    To: <unicode@unicode.org>
    Sent: Tuesday, June 28, 2005 8:33 PM
    Subject: Tamil Collation - Analysis

    > Tamil Nadu state government collation table
    > http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html
    > is the sort order we need to acheieve, (as primary/default sort order).
    >
    > If we do not have to think of future, if we do not have to take count of
    > infrequent usage,
    > then there is a very simple solution.
    > Thai is
    >
    > first sort Independent vowels (அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ)
    > then sort aytham (ஃ)
    > then sort pulli (்)
    > then sort consonant-a (க ங ச ஞ ட ண த ந ப ம ய ர ல வ ழ ள ற ன)
    > then sort dependent vowel (ா ி ீ ு ூ ெ ே ை ொ ோ ௌ)
    >
    > Typical results would be as follows. (If you wish to vie in a text file
    > with linear display, please use aAvarangal font (aAvarangal2 is slightly
    > different). One do not need to understand nor concern about fully rendered
    > display. A linear display is more than enough for development purposes, it
    > is easy to understand and easy to test the software.)
    >
    > sample 1
    > க்க
    > ககக
    > கசக
    > காக
    > கிக
    >
    >
    > sample 2
    > க்க
    > ககக
    > கஙக
    > கசக
    > கஞக
    > காக
    > கிக
    > கீக
    > குக
    > கூக
    > கெக
    > கேக
    > கைக
    > கொக
    > கோக
    > கௌக
    >
    > However followings need to be considered.
    > To be continued ...
    >
    > Regards
    > சின்னத்துரை சிறீவாஸ்
    >
    > ----- Original Message -----
    > From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    > To: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>;
    > <unicode@unicode.org>
    > Sent: Monday, June 27, 2005 12:35 AM
    > Subject: Re: Tamil Collation
    >
    >
    >> Sinnathurai Srivas wrote:
    >>
    >>> Why punishing Tamil for mistakes in Grantham and Unicode?
    >>>
    >>>> 0BCA ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    >>>> 0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    >>>>
    >>>> Note that the sorting algorithm will treat them as identical.
    >>>>
    >>>> A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.
    >>>
    >>> Tamil can process itself at 16 bit (and 8bit)
    >>
    >> This is 16 bit processing! The part of the key for Level 1 comparison
    >> gets 0x197B, the part for Level 2 (basically accent comparison) gets
    >> 0x002, the part for Level 3 (casing etc.) gets 0x002, and the part for
    >> Level 4, which ensures that canonically inequivalent sequences do not
    >> compare equal, gets 0xBCA.
    >>
    >>> Why this punishment by Grantham. ksh forces Tamil to go even the way of
    >>> 48 bit way.
    >>
    >> It doesn't. The start of the 'ksh' entry is sequence of 3 scalar values,
    >> those of KA, VIRAMA, SSA. The punishment is actually for sharing a
    >> planet with Europeans - capitals and accents. (You can only blame Thais
    >> for tone marks, which are treated like accents. I'm not sure that Thai
    >> tone marks weren't based on Vedic accents.)
    >>
    >>> Please find ways to stop this nonsense.
    >>
    >> Did you try to read the Unicode Collation Algorithm?
    >>
    >>> Tamil do not need all these unwanted punishment. We are innocent please.
    >>>
    >>> Lets do 16 bit processing. let's stop un-technical canonism.
    >>> Let's stop vastly complex ksh running havoc with Tamil.
    >>
    >>>>>> If Tamil sorting can be expressed purely by a sorting order of
    >>>>>> consonants
    >>>>>> and vowels, then the answer for sorting words is simply to rearrange
    >>>>>> the
    >>>>>> weights on vowels and letters in the default UCA to accord with this
    >>>> .> ordering.
    >>>>
    >>>>> 99% yes.
    >>>>
    >>>>> Simply, the pulli (virama!), the dependent vowels, vowels and Aytham
    >>>>> need to be weighted and that's it.
    >>
    >> That's not true, as you should know full well. The usual Indic alphabet
    >> ends, gathering bits and pieces, YA, RA, LA, VA, SHA, SSA, SA, HA. Tamil
    >> needed to add NNNA, RRA, LLA and LLLA, and unfortunately modern(?)
    >> Devanagari has added them in a different order to Tamil. The default UCA
    >> orders the consonants in codepoint order, and then to add to the
    >> disagreement Tamil puts the 'Grantha' letters together (so moving JA) and
    >> adds 'ksh'. I believe the basic information may be found in Table 1 at
    >> http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html . Good
    >> news is that the ஸ்ரீ ('shri')
    >> ligature is sorted specially, so collation can reasonably be defined to
    >> make the old and new encodings equivalent!
    >>
    >> The basic changes needed are to change the weights of the consonants. We
    >> need some extra values - how does one express that in a proposal to
    >> change the default algorithm? For thinking about it, we can use
    >> fractional values.
    >>
    >> One nasty feature to implement is that consonant plus pulli comes before
    >> plain consonant. The simplest way of capturing this is to change
    >> consonant entries in the weighting table such as that for KA from
    >>
    >> 0B95 ; [.195C.0020.0002.0B95] # TAMIL LETTER KA
    >>
    >> to
    >>
    >> 0B95 ; [.195C.0020.0002.0B95][.197E.0020.0002.0BCD] # TAMIL LETTER KA
    >> 0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>
    >>
    >> while retaining
    >>
    >> 0BCD ; [.197E.0020.0002.0BCD] # TAMIL SIGN VIRAMA
    >>
    >> for pulli used inappropriately.
    >>
    >> This trick effectively replaces TAMIL SIGN VIRAMA by 'TAMIL SIGN NO
    >> VIRAMA'.
    >>
    >> It's a tad unpleasant in that it lengthens most sort keys. Another
    >> solution is to have an entirely separate weight for consonant plus pulli,
    >> e.g.
    >>
    >> 0B95 ; [.195CH.0020.0002.0B95] # TAMIL LETTER KA
    >> 0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>
    >>
    >> where H means a half. (I really am hitting notational problems here.
    >> Help!)
    >>
    >> There are other details to check, but I hope everyone interested
    >> understands roughly what needs doing.
    >>
    >> Richard.
    >>
    >



    This archive was generated by hypermail 2.1.5 : Tue Jun 28 2005 - 16:32:12 CDT