Issues in Indic Language Collation
Intended Audience: |
Manager, Software Engineer |
Session Level: |
Intermediate |
As the IT market in India grows, so does the need for culturally and
linguistically correct data management for the languages and scripts of
this region. One of the perceived barriers to Unicode implementation in
Indic-script markets is the belief that character encoding order within
Indic scripts (e.g., Devanagari or Tamil) in Unicode is somehow
equivalent to sorting order for the languages using those scripts.
This paper seeks to clarify the distinction between character encoding
and collation and to demonstrate that they are not equivalent concepts
in Indic languages, if collation is implemented correctly. A very brief
overview of Indic character encoding will be given to set the stage for
comparison with linguistic collation of Indic scripts. The linguistic
structures of Indic languages as they pertain to collation will be
examined. Finally, the implementation of Indic linguistic sorting on
Windows 2000 and Windows XP will be discussed and demonstrated (using
Tamil, Hindi, and other Indic languages), giving examples of the key
differences between encoding order and collation. Attendees of this
presentation will come away with an understanding of the distinction
between Unicode encoding order and linguistically correct sorting order
in Indic scripts.
|