[Unicode]  Frequently Asked Questions Home | Site Map | Search

Indic Scripts and Languages

Q: What is ISCII?

A: Indian Standard Code for Information Interchange (ISCII) is the character code for Indian languages that originate from Brahmi script. ISCII was evolved by a standardization committee under the Department of Electronics during 1986-88, and adopted by the Bureau of Indian Standards (BIS) in 1991. Unlike Unicode, ISCII is an 8-bit encoding that uses escape sequences to announce the particular Indic script represented by a following coded character sequence. The ISCII document is IS13194:1991, available from the BIS offices.

The ISCII Standard can be found on the web, for example at Sourceforge.

Q: How does Unicode differ from ISCII?

A: Except for a few minor differences, they correspond directly. Unicode is designed to be a multilingual encoding that requires no escape sequences or switching between scripts. For any given Indic script, the consonant and vowel letter codes of Unicode are based on ISCII. ISCII allowed control over character formation by combining letters with the characters NUKTA, INV, and HALANT. Unicode provides similar control with the ZWJ and ZWNJ characters.

The prototypical example is the "explicit halant":


    Halant + Halant


    Halant + ZWNJ

The "soft halant" of ISCII is expressed:


    Halant + Nukta


    Halant + ZWJ

The "explicit halant" is discussed in the ISCII standard, section 6.3.1 and "soft halant" is discussed in 6.3.2.

There are several categories of such differences. See also Chapter 12, South Asian Scripts-I in The Unicode Standard for details. Unicode also includes the right side "pieces" of some two-part vowel signs for compatibility with some software. For more on vowel pieces, see below.

The ISCII Attribute code (ATR) is not represented in the Unicode Standard, which is a plain text standard. The ISCII Attribute code is intended to explicitly define a font attribute applicable to following characters, and thus represents an embedded control for the kinds of font and style information which is not carried in a plain text encoding.

The ISCII Extension code (EXT) is also not represented directly in the Unicode Standard. The Extension code is an escape mechanism, allowing the 8-bit ISCII standard to define an extended repertoire via an escaped reencoding of certain byte values. Such a mechanism is not required in the Unicode Standard, which simply uses additional code points to encode any additional character repertoire.

Q: Unicode doesn't have an "invisible letter" (INV) like ISCII. How can I form the combinations that use INV in ISCII?

A: There are four uses of nukta in ISCII. Unicode only uses the first two. Unicode doesn't use nukta for soft halant and doesn't use it for code extension. Unicode does use nukta to represent the nukta diacritic either in cases such as "ka" U+0958 or cases like "nnna" U+0929. Unicode doesn't use nukta for the "om" character (eg. chandrabindu + nukta in ISCII, which is encode as a separate character in Unicode). One other use of INV in ISCII is as a base letter, this may be expressed with a space or no-break space in Unicode, depending on whether the result is to be a "word-like" character or not:



INV + vowel-sign

   SPACE + vowel-sign

INV + vowel-sign

   NBSP + vowel-sign

Q: Is India involved in Unicode?

A: The Government of India is a member of the Unicode Consortium, and has been engaged in a dialogue with the UTC about additional characters in the Indic blocks and improvements to the textual descriptions and annotations.

Q: How do the Indic scripts work in Unicode?

A: See Chapter 12, South Asian Scripts-I in The Unicode Standard.

Particularly relevant is the section on Devanagari, which is a detailed description not only of the Devanagari script but also outlines the model used for all similarly structured scripts in the standard. This model is the based on the ISCII model.

Information about the OpenType format and the Uniscribe can be found in the excellent article Windows Glyph Processing by John Hudson.  [AJ]

Q: Does Unicode cover Vedic accents?

A: Yes. Characters used to indicate tone in Vedic Sanskrit appear in the Devanagari Extended block, the Vedic Extensions block, and the Devanagari block. A brief overview is given in the Devanagari Extended and Vedic Extensions block introductions in Chapter 12, South Asian Scripts-I in The Unicode Standard.

Q: What is the difference between Unicode fonts and other fonts?

A: First, for "What is a Unicode Font" see the Font FAQ. The font would need to contain a glyph for each allocated code point of the script. For example, Gujarati would contain glyphs for the allocated code points in the range: U+0A80 - U+0AFF. In addition to these, the font should have: (a) glyphs for conjuncts; (b) variants for vowel signs (matras), vowel modifiers (Chandrabindu, Anuswar), the consonant modifier (Nukta); (c) digits and any appropriate punctuation marks (perhaps some that are appropriate from the Latin ranges).

The contents of (a) and (b) depend not only on the typographical quality the font is intended to achieve but also whether the font has glyphs just in contemporary use or also includes those used in traditional formats.

The contents of (a) and (b) can be accessed by providing a Glyph Substitution table in the font. Such a table is more often than not a necessity for Indic scripts. A Glyph Positioning table is also a need for achieving the minimal required mark positioning in such scripts. More information on these issues is contained in the OpenType Specification.

There is also a specification for Creating and Supporting OpenType Fonts for Indic Scripts.   [AJ]

Q: Are there separate Unicode fonts?

A: A font that has glyphs mapped as above is a Unicode font. Although some tables for such fonts are common and a necessity (cmap, name, OS/2 etc.); others will depend on the type of glyph outlines (TrueType, PostScript...)   [AJ]

Q: If yes, where are they available?

A: Microsoft has made several OpenType Indic script fonts with TrueType outlines, such as:

Latha - Tamil
Mangal (Devanagari)
Raavi (Gurmukhi and Devanagari)
Shruti (Gujarati and Devanagari)
Tunga (Kannada and Devanagari)

These fonts are also available for download from the community site of VOLT (see below). 

The Indic fonts shipped with Apple's OSX and iOS have the proper AAT tables to support Indic languages using the Unicode encoding.

There are also many other small development teams creating Indic fonts. Many of them are listed on Alan Wood's Unicode Fonts page.

Q: Is it possible to convert other fonts to Unicode?

A: Yes there have been many tools released that will allow a conversion. Some of the better known ones are:

Microsoft's Visual OpenType Layout Tool (VOLT)
Apples Font Tools
Adobe's Font Development Kit
Pyrus' FontLab
FontForge (X-11-based, for Mac OSX, Cygwin, etc.) (for the Linux OS)

Also see the specification for Creating and Supporting OpenType Fonts for Indic Scripts.

Q: Do I need an IME to properly input Indic script languages?

A: Indic languages can be input via a traditional keyboard, with a proper keyboard mapping. The work then falls to the rendering engine to display the characters in their proper order and shape. [CW]

Q: Is the keyboard arrangement in a Unicode system different from that of the regular "TTF" fonts?

A: Keyboarding questions are separate from the questions of encoding. Some of the keyboards provided with Windows can been seen on Microsoft's Windows Keyboard Layout website. [AJ]

Q: I have specific questions about Tamil. Where are the answers?

A: See the Tamil FAQ.

Q: I have specific questions about Bengali (Bangla). Where are the answers?

A: See the Bengali (Bangla) FAQ.

Q: What about collation of Indic language data? Is that just a binary sort?

A: No. Collation order is not the same as code point order. A good treatment of some issues specific to collation in Indic languages can be found in the paper Issues in Indic Language Collation by Cathy Wissink.

Collation in general must proceed at the level of language or language variant, not at the script or codepoint levels. See also UTS #10: Unicode Collation Algorithm. Some Indic-specific issues are also discussed in that report.

Q: I cannot find the "half forms" of Devanagari letters (or any other Indic script) in the Unicode code charts. These characters are needed to form words such as "patni".

A: Unicode does not encode half or subjoined letters for the scripts of India. Like in the ISCII standard, Unicode forms all "consonant clusters" (such as the "tn" in "patni") by inserting the character "virama" (or "halant") between the two relevant consonant letters. For instance, the Devanagari syllable "tna" ("ligature tna") is encoded with the following code points:


These three characters will be normally displayed using the single glyph tna ligature "dev-tna-ligature". But it is also possible that they are displayed using a half ta glyph followed by a full na glyph "dev-half-ta-na", or even with a full ta glyph combined with a virama glyph and followed by a full na glyph "dev-full-ta-virama-full-na".

Which form will be actually displayed is the decision of an underlying software module called a "display engine", which bases this decision on the availability of glyphs in the font.

If the sequence U+0924, U+094D is not followed by another consonant letter (such as "na") it is always displayed as a full ta glyph combined with the virama glyph "dev-ta-virama".

Unicode provides a way to force the display engine to show a half letter form. To do this, an invisible character called ZERO WIDTH JOINER should be inserted after the virama:


This sequence is always displayed as a half ta glyph followed by a full na glyph "dev-half-ta-na". Even if the consonant "na" is not present, the sequence U+0924, U+094D, U+200D is displayed as a half ta glyph "dev-half-ta".

Unicode also provides a way to force the display engine to show the virama glyph. To do this, an invisible character called ZERO WIDTH NON-JOINER should be inserted after the virama:


This sequence is always displayed as a full ta glyph combined with a virama glyph and followed by a full na glyph "dev-full-ta-virama-full-na

For more detailed information, see Chapter 12, South Asian Scripts-I in The Unicode Standard. For related issues, see "Where is My Character?" [MC]

Q: Can you rename the character called VIRAMA in my script to HALANT?

A. In the Unicode Standard, the sign indicating the absence of an inherent vowel in Indic scripts is denoted by the Sanskrit word virama. In the particular languages another designation is often preferred. In Hindi, for example, the word hal refers to the character itself, and halant refers to the consonant that has its inherent vowel suppressed; in Tamil, the word pulli is used; in Bengali, the word hasant is used, and so on.

The Unicode stability policies prevent character names from being changed. However, the code charts and character descriptions often contain annotations showing the preferred name, such as:


= halant (the preferred Hindi name)
• suppresses inherent vowel

Q: KANNADA VOWEL SIGN I (U+0CBF) and KANNADA VOWEL SIGN E (U+0CC6) seem to have inconsistent character properties. They have General Category Mn and Bidi_Class L. However, UAX #9 says that all Me and Mn category characters are Bidi_Class NSM. Is this right?

A. Yes. This was an explicit decision by UTC for these characters, to preserve canonical equivalence under the Unicode Bidirectional Algorithm (UBA) for two vowels involving these as parts of decompositions.

The UBA is designed to maintain canonical equivalence. Normally all of the combining characters have the Bidi_Class NSM, but when combining characters would cause problems for canonical equivalence, they are given different Bidi_Class values.

Q: How are the Sindhi implosives represented?

A. The characters U+097B DEVANAGARI LETTER GGA, U+097C DEVANAGARI LETTER JJA, U+097E DEVANAGARI LETTER DDDA, and U+097F DEVANAGARI LETTER BBA are used to write Sindhi implosive consonants. Versions of the Unicode Standard prior to Version 5.0 recommended the representation of Sindhi implosive consonants by sequences of the plain consonant letters followed by anudatta (or by nukta). Such sequences are no longer recommended. [EM]