RE: Clarifications on Thamizh Character Set Standardisations

From: Marco.Cimarosti@icl.com
Date: Thu Jun 01 2000 - 05:04:43 EDT


Padmakumar wrote:
> beeing a Thamizhan i am having a feeling that..
> letters of my mother-tongue is not in a
> traditional order in one of the important global standard..
> It is like an english man feeling that english alphabets
> are in the wrong order like A, E ,I, O U, B, P, C, K, G, J,
> H, ....... YES, this english order is in the phonetic order....
> but not in the traditional order...

If it can help you, I can say that even for my language, Italian, the
letters are not in the correct order in Unicode: the accented vowels (à, è,
é, ì, ò, ù) are allocated at U+00E0..U+00F9: several positions away from the
regular vowels (a, e, i, o, u), which are in positions U+0061..U+0075.

So if you sort the phrase "la mia àncora è ancora in acqua" you get: <acqua,
ancora, in, la, mia, àncora, è>. But the correct order should be: <acqua,
ancora, àncora, è, in, la, mia>.

The bad news is that this could not be fixed even if Unicode accepted to
change the position of all Latin letters just to please Italians! It would
not be enough that a and à, e and è, etc. be *near* to each other: they
should be in the *same* positions -- and, of course, this is not possible.

More bad news is that this problem exists for nearly all languages,
including English. If you notice, all capital letters in the "English"
alphabet (U+0041..U+005A) are 32 positions *before* the corresponding small
letters (U+0061..U+007A). So if you sort by these values, the same word
written in different capitalization (e.g. "example", "Example", "EXAMPLE")
would end up in completely different positions.

The good news is that there *is* a solution for this problem: sorting text
in Thamizh, Italian, *can* be correct if you forget Unicode code points and
use specific "collation keys".

A default definition for such an algorithm is in Unicode Technical Report 10
(http://www.unicode.org/unicode/reports/tr10/ see also
http://www.unicode.org/unicode/reports/tr10/charts/ for a visual
representation of the results).

More good news is that the sorting order in UTR 10 is just a "default"
multilingual order. If you don't like it, you can customize the keys to get
exactly the order that you wish for a particular language (or set of
languages).
And this is also great for languages that have multiple competing sorting
orders. E.g., many Italians consider i and j to be the same letter, and want
them to be sorted together. But other people don't like this (because it is
not good with words of foreign origin). With collation algorithms, the two
groups of people can both be happy, by using different keys.

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT