Tamil Language and Script
Q: How was the encoding of the Tamil script in the Unicode Standard established?
The encoding of the Tamil script in the Unicode Standard was originally based on ISCII (1988). That encoding was the culmination of extensive work by many experts, including linguists, programmers, typographers, and experts in standards, although constrained by 8-bit character encodings prevalent in India at that time. Like the rest of Unicode, the encoding of Tamil is identical to that used in the International Standard ISO/IEC 10646.
Q: Are there shipping implementations of Unicode Tamil?
Unicode support for Tamil is implemented in all major desktop and mobile operating systems and browsers.
Q: Are there special issues with sorting Tamil text in Unicode?
Sorting order can almost never be handled by character placement in a code chart. That is true even for English—after all, Unicode code chart "Z" sort before "a". Furthermore, languages using the same script often sort differently, so handling sorting order is always separate from the encoding.
The Unicode Standard has an extensive associated standard, UTS #10, Unicode Collation Algorithm, devoted entirely to specifying mechanisms for collation. The Unicode Common Locale Data Repository (CLDR) then provides specific data for language-specific sorting for different languages.
Issues or improvements associated with the sorting or matching of characters in Tamil should be addressed in the context of the Unicode Collation Algorithm and the CLDR sort order specifications.
Q: Are details of the encoding important for natural language processing?
When considering the requirements of natural language processing, it is important to recognize the main purpose of the Unicode Standard: it is a plain text encoding, aimed at the problem of simple representation of textual content in traditional orthographies. There is of course nothing in the Unicode Standard that prevents researchers from developing higher-level protocols, such as markup schemes, to represent other aspects of textual content, including linguistic structure not directly evident from the ordinary writing system. Nor does the Unicode Standard preclude the development of alternative textual encodings for special-purpose processing such as automated NLP.
However, it would run counter to Unicode encoding principles to attempt to incorporate such higher-level protocols or alternative, special-purpose encodings directly into the Unicode Standard itself. A syllable-based re-encoding of Tamil, if aimed at NLP issues, is, therefore, essentially out of scope for the Unicode Standard. This is simply a matter of the level appropriateness for the representation of data in a plain text character encoding, over and beyond the crucial issue of maintaining the stability of the standard for existing implementations.
Q: Is Unicode encoding efficient for Tamil?
Efficiency of processing is not a simple matter of running a few test cases aimed at one or two processes. The Unicode encoding of any script, including Tamil, is meant to have a good overall efficiency in many kinds of text processing. Furthermore, efficiency considerations have to be balanced against many other considerations, including algorithmic complexity, legacy interoperability, and parallelism in support of multiple scripts and fonts.
In particular, comparisons of raw text size in bytes, under various encoding assumptions, are really only relevant to a limited number of operations involving pure plain text. Most real-world applications with text involve embedding of text in larger contexts of markup, graphics, and other data, and in such cases, efficiency concerns are dominated by the size of the other content, rather than that of the plain text content per se. There is thus no cause for advocating new encodings of scripts already encoded in the Unicode Standard based merely on comparisons of encoded text size. This is particularly true given that the resulting costs and impact of destabilizing the standard would far outweigh any marginal gains in processing speed in some limited contexts.
Moreover, there are compression schemes and other types of secondary techniques that can be used to achieve greater efficiency in speed of text processing, storage, and data retrieval for specific applications in specific languages. General purpose compressions such as ZIP work well.
Q: What about using private-use characters for encoding Tamil pure consonants and syllables?
This is a fine solution for internal processing, if an alternate representation is useful for the particular process. For example, a text-to-speech program might use a private-use encoding for English, whereby letters were separated according to pronunciations—the 'o' in 'love', 'rove', and 'move' all getting different private-use characters.
However, such implementations have limited usage. Private-use characters may overlap between different implementations, so general purpose programs cannot assume any particular interpretation of such characters. In general interchange, such as in search engines, private-use characters are typically treated as unknown characters or ignored. As a result private-use characters are inappropriate for open interchange.
Q: How are the Tamil pure consonants and syllables represented in Unicode?
Section 12.6, Tamil in The Unicode Standard contains a full table for Tamil, documenting all of the pure consonants and all of the syllables, showing exactly how they should appear and the precise sequence of Unicode code points used for each. The table is arranged in traditional Tamil syllable order, which is also important for understanding how Tamil should be sorted. We recommend use of that table as a starting point for discussions about Tamil in Unicode, because it makes it easier to understand how all Tamil consonants and syllables are represented in Unicode. Starting from the Unicode code chart itself can lead to misunderstandings.
Q: Do the Tamil pure consonants and syllables also have Unicode names?
Named sequences have been added for all Tamil pure consonants and syllables.
Q: If there are missing characters for Tamil, will those be encoded?
The Unicode Consortium has been very interested in the feedback it has received regarding missing characters, usage corrections, and improvements to the Tamil script block description. This feedback has markedly improved the coverage for Tamil for successive versions. For example, many historic fractions and others signs for Tamil have been added in the Tamil Supplement block. Continued input will lead to further improvements in the future.
The Unicode Standard can also add additional specifications of the behavior of sequences of Tamil characters. Such specifications can encompass many of the perceived advantages of a separate new encoding for Tamil, without requiring a disruptive change in the encoding.
Both the state of Tamil Nadu and the Government of India have participated as members of the Unicode Consortium, and the Consortium looks especially to them to help further improve the ability of Unicode to address the worldwide Tamil community.
Q: How is the Tamil syllable ஹோ (hō) encoded in Unicode?
This syllable can be encoded in two ways:
A. |
|
|
|
B. |
|
|
|
The consonant is always first in either case. The second character in Line A can be decomposed, but the order of occurrence in memory is always the same. In both cases the appearance is ஹோ. This is clearly documented in Section 12.6, Tamil in The Unicode Standard.
Note that Line A is in form NFC, which is the preferred form for most applications including HTML, XML and on the net. For more information, see UAX #15, Unicode Normalization Forms.
Q: Where can I find out more about Tamil Digit Zero?
"Tamil Digit Zero" is a modern innovation. An encoding for Tamil zero was added as of Unicode 4.1 in 2005, U+0BE6 TAMIL DIGIT ZERO, for implementations which need to support it. For more information on Tamil digits please Unicode Technical Note #21, "Tamil Numbers".
Q: What is the correct encoding for Tamil ligature shri?
Prior to Unicode 4.1, the best mapping to represent the ligature shri was to the sequence <U+0BB8, U+0BCD, U+0BB0, U+0BC0>. Unicode 4.1 added the character U+0BB6 TAMIL LETTER SHA and as a consequence, the best mapping became <U+0BB6, U+0BCD, U+0BB0, U+0BC0>. Both representations are widespread in existing text. Therefore, treating both representations as equivalent sequences is recommended, particularly in identifiers, such as domain names.
Q: Where can I find more about other scripts of India and South Asia?
See Indic Scripts.