[Unicode]  Frequently Asked Questions Home | Site Map | Search

Tamil Language and Script

Q: How was the encoding of the Tamil script in the Unicode Standard established?

A: The encoding of the Tamil script in the Unicode Standard was originally based on ISCII (1988). That encoding was the culmination of extensive work by many experts, including linguists, programmers, typographers, and experts in standards, although constrained by 8-bit character encodings prevalent in India at that time. Like the rest of Unicode, the encoding of Tamil is identical to that used in the International Standard ISO/IEC 10646.

Q: Are there shipping implementations of Unicode Tamil?

A: Yes. There are many shipping implementations of Unicode Tamil today and there will continue to be more in the future.

Q: Is there any controversy about the encoding of Tamil?

A: Some organizations, including the government of Tamil Nadu, have proposed a new encodings of the Tamil script, called TACE-16, based on Tamil syllables or clusters rather than letters. The Indian Ministry of Information Technology has also been prominent in the discussions and investigations of this encoding. Several technical reasons have been given in support of this proposal, including collation (sorting) order, natural language processing (NLP), and database efficiency considerations.

Q: Are these concerns valid? What about Tamil sorting order?

A: Sorting order cannot be handled by character placement in a code chart. That is true even for English—after all, Unicode code chart order would make "Z" sort before "a". Furthermore, languages using the same script often sort differently, so no single encoding order is appropriate in all contexts.

The Unicode Standard has an extensive associated standard, UTS #10, Unicode Collation Algorithm, devoted entirely to specifying mechanisms for collation. The Unicode Common Locale Data Repository (CLDR) then provides specific data for language-specific sorting for different languages.

Whenever there are issues associated with the sorting or matching of characters in Tamil, they should be addressed in the context of the Unicode Collation Algorithm and the CLDR sort order specifications, rather than by proposing a new character encoding for Tamil with a different order of characters in the code chart.

Q: But isn't the encoding important for natural language processing?

A: When considering the requirements of natural language processing, it is important to recognize the main purpose of the Unicode Standard: it is a plain text encoding, aimed at the problem of simple representation of textual content in traditional orthographies. There is of course nothing in the Unicode Standard that prevents researchers from developing higher-level protocols, such as markup schemes, to represent other aspects of textual content, including linguistic structure not directly evident from the ordinary writing system. Nor does the Unicode Standard preclude the development of alternative textual encodings for special-purpose processing such as automated NLP.

However, it would run counter to Unicode encoding principles to attempt to incorporate such higher-level protocols or alternative, special-purpose encodings directly into the Unicode Standard itself. A syllable-based re-encoding of Tamil, if aimed at NLP issues, is, therefore, essentially out of scope for the Unicode Standard. This is simply a matter of the level appropriateness for the representation of data in a plain text character encoding, over and beyond the crucial issue of maintaining the stability of the standard for existing implementations.

Q: But isn't the current Unicode encoding less efficient?

A: Efficiency of processing is not a simple matter of running a few test cases aimed at one or two processes. The Unicode encoding of any script, including Tamil, is meant to have a good overall efficiency in many kinds of text processing. Furthermore, efficiency considerations just in terms of min-maxxing CPU cycles and buffer sizes have to be balanced against many other considerations, including algorithmic complexity, legacy interoperability, and parallelism in support of multiple scripts and fonts.

In particular, comparisons of raw text size in bytes, under various encoding assumptions, are really only relevant to a limited number of operations involving pure plain text. Most real-world applications with text involve embedding of text in larger contexts of markup, graphics, and other data, and in such cases, efficiency concerns are dominated by the size of the other content, rather than that of the plain text content per se. There is thus no cause for advocating new encodings of scripts already encoded in the Unicode Standard based merely on comparisons of encoded text size. This is particularly true given that the resulting costs and impact of destabilizing the standard would far outweigh any marginal gains in processing speed in some limited contexts.

Moreover, there are compression schemes and other types of secondary techniques that can be used to achieve greater efficiency in speed of text processing, storage, and data retrieval for specific applications in specific languages. General purpose compressions such as ZIP work well, and the Unicode Consortium also publishes other relevant technical reports and standards, such as UTS #6, A Standard Compression Scheme for Unicode.

Q: What about using private-use characters for encoding Tamil pure consonants and syllables?

A: This is a fine solution for internal processing, if an alternate representation is useful for the particular process. For example, a text-to-speech program might use a private-use encoding for English, whereby letters were separated according to pronunciations—the 'o' in 'love', 'rove', and 'move' all getting different private-use characters.

However, such implementations have limited usage. Private-use characters may overlap between different implementations, so general purpose programs cannot assume any particular interpretation of such characters. In general interchange, such as in search engines, private-use characters are typically treated as unknown characters or ignored. As a result private-use characters are inappropriate for open interchange.

Q: Why not make this change to Tamil pure consonants and syllables anyway, if it will make some people happy?

A: One of the strengths of the Unicode Standard is sharing large portions of its encoding model between different scripts, while still preserving features that make each script unique. This eases the implementation burden tremendously. There are many implementations produced both inside and outside of the members of the Unicode Consortium that have benefited from being able to support a much wider range of languages and scripts than they ever would have been able to if they had to have dedicated teams of linguists to implement each one.

Another particularly crucial feature of Unicode is its guarantee of stability: the encoding cannot change in ways that significantly break existing implementations (for more information, see Unicode Consortium Policies).

Dual encodings of scripts have proven to be so problematical that no such encodings are considered appropriate for addition, once a script has been encoded in the standard. These stability guarantees are crucial both for current product implementation and for other standards and protocols, such as those for the Internet. Unicode must continue to be a stable platform for implementers and protocol designers to build upon, since the development cycle of products is quite long, and the lifetime of protocols must be much longer.

Q: So how are the Tamil pure consonants and syllables represented in Unicode?

A: Chapter 9 of the Unicode Standard contains a full table for Tamil, documenting all of the pure consonants and all of the syllables, showing exactly how they should appear and the precise sequence of Unicode code points used for each. The table is arranged in traditional Tamil syllable order, which is also important for understanding how Tamil should be sorted. We recommend use of that table as a starting point for discussions about Tamil in Unicode, because it makes it easier to understand how all Tamil consonants and syllables are represented in Unicode. Starting from the Unicode code chart itself can lead to misunderstandings.

Q: Do the Tamil pure consonants and syllables also have Unicode names?

A: Named sequences have been added for all Tamil pure consonants and syllables.

Q: If there are missing characters for Tamil, will those be encoded?

A: The Unicode Consortium has been very interested in the feedback it has been receiving in regard to missing characters, usage corrections, and improvements to the Tamil script block description.

Examples of recent encoded characters include:

New in 4.0
New in 4.1
New in 5.1

This critical feedback has markedly improved Tamil's position in Unicode for successive versions, and this continued input will continue to be responsible for improvements in the future.

The Unicode Standard can also add additional specifications of the behavior of sequences of Tamil characters. Such specifications can encompass many of the perceived advantages of a separate new encoding for Tamil, without requiring a disruptive change in the encoding.

The Government of India is a member of the Unicode Consortium, and the Consortium looks especially to them to help further improve the ability of Unicode to address the worldwide Tamil community.

Q: Doesn't the Tamil syllable Tamil syllable cause a problem in Unicode? This syllable can be encoded in two ways. In one case the characters are out of order, so doesn't this cause problems in text comparison and parsing?

A: The syllable can be represented in two ways:








However, Line B above is incorrect. The two correct possibilities are the following:








The consonant is always first in either case. The second character in Line A can be decomposed, but the order of ocurrence in memory is always the same. In both cases the appearance is Tamil syllable. This is clearly documented in Section 9.6, Tamil in the Unicode Standard.

Note that Line A is in form NFC, which is the preferred form for most applications including HTML, XML and on the net. For more information, see UAX #15, Unicode Normalization Forms.

Q: What can you tell me about Tamil Digit Zero?

A: "Tamil Digit Zero" is a modern innovation. An encoding for Tamil zero was added as of Unicode 4.1, U+0BE6 TAMIL DIGIT ZERO, for implementations which need to support it. For more information on Tamil digits please see Unicode Technical Note #21: "Tamil Numbers".

Q: What is the mapping for TSCII grantha ligature 0x82 SRI?

A: Prior to Unicode 4.1, the best mapping is to the sequence <U+0BB8, U+0BCD, U+0BB0, U+0BC0>. Unicode 4.1 added the character U+0BB6 TAMIL LETTER SHA and as a consequence, the mapping should be updated to <U+0BB6, U+0BCD, U+0BB0, U+0BC0>. [EM]