From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Fri Jun 24 2005 - 14:06:40 CDT
(Draft version)
Tamil Collation vs Transliteration/Transcription Encodinng
Though it undergoes numerous implementation problems, Unicode is based on a 
highly sophisticated technical architecture. In this article how Unicode 
mishandled Tamil collation and analyses the alternative solutions to attain 
Tamil Collation.
Any implementation would initially attempt for a natural sort order for a 
language, where by the default hex order of codes would be a natural sort 
order of that language. The question now is why Unicode decided to deny this 
natural facility to Tamil, in its implementation strategy. The answer is, in 
Unicode's consideration there is another requirement that was considered 
more important than sorting order of Tamil. The requirement was, the 
transliteration properties of code order of all Indian languages must be the 
same and sort order was considered a minute matter in comparison to sort 
order. Unicode decided that writing softwares to transliterate between 
different Indic languages is a more daunting task than writing software to 
collate a language.
However, Devanagari had it's upper hand in getting it natural sort order 
encoded, while other languages were forced to abandon the natural sort order 
in favour of transliteration code order. All these other languages now face 
the task of implementing fixes to get the collation working.
Unlike Latin based languages, each Indic languages use alphabet of their 
own. For this reason abandoning natural sort order in favour of 
transliteration sort order was not a technical but a political decision by 
Unicode. Unicode did understand the damage it made to the suffering 
languages, but decided to go along with it's political decision, forcing 
minority languages to obey orders. Software routines to do transliteration 
is a simple task, compared to software routines to collate a scrambled 
encoding. Unicode still decided to enforce its political agenda over a 
technical requirement.
Unicode transliteration scheme does not work. The saddest thing of all is 
that the transliteration does not work as Unicode hoped it. There never was 
a simple transliteration mechanism suitable for encoding different 
languages. For example, Tamil writing system is based on phonemic based 
Alphabet system, while Devanagari is based on phonemic only system. In Tamil 
k = k, h, g, x, q, c (mahaL, magan, makkan, quil, xavier, etc..). In 
Devanagari individual glyph shapes represent each of these phonemes. In 
Tamil aspirated and many other sounds are written using a single modulating 
indicator called Aytham, yet an unacceptably high number of code points 
allocated for Tamil is deprecated and made unusable because of this 
transliteration encoding that never works.
It is important to understand that a superior architecture like Unicode, 
made inferior by misguided political requirement is not going to be an easy 
task to resolve. There fore it is very important that we start work on 
fixing the bug caused by transliteration based encoding to do the collation 
as required. We will analyse the collation techniques available to fix the 
problem caused by transliteration based encoding bug.
To be continued.... 
This archive was generated by hypermail 2.1.5 : Fri Jun 24 2005 - 14:09:44 CDT