Twentieth International Unicode Conference

Issues in Indic Language Collation

Cathy Wissink - Microsoft Corporation

Intended Audience:	Manager, Software Engineer
Session Level:	Intermediate

As the IT market in India grows, so does the need for culturally and linguistically correct data management for the languages and scripts of this region. One of the perceived barriers to Unicode implementation in Indic-script markets is the belief that character encoding order within Indic scripts (e.g., Devanagari or Tamil) in Unicode is somehow equivalent to sorting order for the languages using those scripts.

This paper seeks to clarify the distinction between character encoding and collation and to demonstrate that they are not equivalent concepts in Indic languages, if collation is implemented correctly. A very brief overview of Indic character encoding will be given to set the stage for comparison with linguistic collation of Indic scripts. The linguistic structures of Indic languages as they pertain to collation will be examined. Finally, the implementation of Indic linguistic sorting on Windows 2000 and Windows XP will be discussed and demonstrated (using Tamil, Hindi, and other Indic languages), giving examples of the key differences between encoding order and collation. Attendees of this presentation will come away with an understanding of the distinction between Unicode encoding order and linguistically correct sorting order in Indic scripts.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

30 September 2001, Webmaster