[Unicode]  Frequently Asked Questions Home | Site Map | Search

Indic Scripts and Languages

Q: What is ISCII?

A: Indian Standard Code for Information Interchange (ISCII) is the character code for Indian languages that originate from Brahmi script. ISCII was evolved by a standardization committee under the Department of Electronics during 1986-88, and adopted by the Bureau of Indian Standards (BIS) in 1991. Unlike Unicode, ISCII is an 8-bit encoding that uses escape sequences to announce the particular Indic script represented by a following coded character sequence. The ISCII document is IS13194:1991, available from the BIS offices.

The ISCII Standard can be found on the web, for example at Sourceforge.

Q: How does Unicode differ from ISCII?

A: Except for a few minor differences, they correspond directly. Unicode is designed to be a multilingual encoding that requires no escape sequences or switching between scripts. For any given Indic script, the consonant and vowel letter codes of Unicode are based on ISCII. ISCII allowed control over character formation by combining letters with the characters NUKTA, INV, and HALANT. Unicode provides similar control with the ZWJ and ZWNJ characters.

The prototypical example is the "explicit halant":

ISCII:

    Halant + Halant

Unicode:

    Halant + ZWNJ

The "soft halant" of ISCII is expressed:

ISCII:

    Halant + Nukta

Unicode:

    Halant + ZWJ

The "explicit halant" is discussed in the ISCII standard, section 6.3.1 and "soft halant" is discussed in 6.3.2.

There are several categories of such differences. See also Chapter 9, South Asian Scripts-I in the Unicode Standard for details. Unicode also includes the right side "pieces" of some two-part vowel signs for compatibility with some software. For more on vowel pieces, see below.

The ISCII Attribute code (ATR) is not represented in the Unicode Standard, which is a plain text standard. The ISCII Attribute code is intended to explicitly define a font attribute applicable to following characters, and thus represents an embedded control for the kinds of font and style information which is not carried in a plain text encoding.

The ISCII Extension code (EXT) is also not represented directly in the Unicode Standard. The Extension code is an escape mechanism, allowing the 8-bit ISCII standard to define an extended repertoire via an escaped reencoding of certain byte values. Such a mechanism is not required in the Unicode Standard, which simply uses additional code points to encode any additional character repertoire.

Q. Unicode doesn't have an "invisible letter" (INV) like ISCII. How can I form the combinations that use INV in ISCII?

A: There are four uses of nukta in ISCII. Unicode only uses the first two. Unicode doesn't use nukta for soft halant and doesn't use it for code extension. Unicode does use nukta to represent the nukta diacritic either in cases such as "ka" U+0958 or cases like "nnna" U+0929. Unicode doesn't use nukta for the "om" character (eg. chandrabindu + nukta in ISCII, which is encode as a separate character in Unicode). One other use of INV in ISCII is as a base letter, this may be expressed with a space or no-break space in Unicode, depending on whether the result is to be a "word-like" character or not:

ISCII

   Unicode

INV + vowel-sign

   SPACE + vowel-sign

INV + vowel-sign

   NBSP + vowel-sign

Q: Is India involved in Unicode?

A: The Government of India is a member of the Unicode Consortium, and has been engaged in a dialogue with the UTC about additional characters in the Indic blocks and improvements to the textual descriptions and annotations.

Q: How do the Indic scripts work in Unicode?

A: See Chapter 9 of the Unicode Standard, South Asian Scripts-I.

Particularly relevant is the section on Devanagari, which is a detailed description not only of the Devanagari script but also outlines the model used for all similarly structured scripts in the standard. This model is the based on the ISCII model.

Information about the OpenType format and the Uniscribe can be found in the excellent article Windows Glyph Processing by John Hudson.  [AJ]

Q: Does Unicode cover Vedic accents?

A: Yes. Characters used to indicate tone in Vedic Sanskrit appear in the Devanagari Extended block, the Vedic Extensions block, and the Devanagari block. A brief overview is given in the Devanagari Extended and Vedic Extensions block introductions in Chapter 9, South Asian Scripts-I in the Unicode Standard.

Q: What is the difference between Unicode fonts and other fonts?

A: First, for "What is a Unicode Font" see the Font FAQ. The font would need to contain a glyph for each allocated code point of the script. For example, Gujarati would contain glyphs for the allocated code points in the range: U+0A80 - U+0AFF. In addition to these, the font should have: (a) glyphs for conjuncts; (b) variants for vowel signs (matras), vowel modifiers (Chandrabindu, Anuswar), the consonant modifier (Nukta); (c) digits and any appropriate punctuation marks (perhaps some that are appropriate from the Latin ranges).

The contents of (a) and (b) depend not only on the typographical quality the font is intended to achieve but also whether the font has glyphs just in contemporary use or also includes those used in traditional formats.

The contents of (a) and (b) can be accessed by providing a Glyph Substitution table in the font. Such a table is more often than not a necessity for Indic scripts. A Glyph Positioning table is also a need for achieving the minimal required mark positioning in such scripts. More information on these issues is contained in the OpenType Specifications.

There is also a specification for Creating and Supporting OpenType Fonts for Indic Scripts.   [AJ]

Q: Are there separate Unicode fonts?

A: A font that has glyphs mapped as above is a Unicode font. Although some tables for such fonts are common and a necessity (cmap, name, OS/2 etc.); others will depend on the type of glyph outlines (TrueType, PostScript...)   [AJ]

Q: If yes, where are they available?

A: Microsoft has made several OpenType Indic script fonts with TrueType outlines, such as:

Latha - Tamil
Mangal (Devanagari)
Raavi (Gurmukhi and Devanagari)
Shruti (Gujarati and Devanagari)
Tunga (Kannada and Devanagari)

These fonts are also available for download from the community site of VOLT (see below). 

The Indic fonts shipped with Apple's OSX and iOS have the proper AAT tables to support Indic languages using the Unicode encoding.

There are also many other small development teams creating Indic fonts. Many of them are listed on Alan Wood's Unicode Fonts page.

Q: Is it possible to convert other fonts to Unicode?

A: Yes there have been many tools released that will allow a conversion. Some of the better known ones are:

Microsoft's Visual OpenType Layout Tool (VOLT)
Apples Font Tools
Adobe's Font Development Kit
Pyrus' FontLab
FontForge (X-11-based, for Mac OSX, Cygwin, etc.) (for the Linux OS)

Also see the specification for Creating and Supporting OpenType Fonts for Indic Scripts.

Q: Do I need an IME to properly input Indic script languages?

A: Indic languages can be input via a traditional keyboard, with a proper keyboard mapping. The work then falls to the rendering engine to display the characters in their proper order and shape. [CW]

Q: Is the keyboard arrangement in a Unicode system different form that of the regular "TTF" fonts?

A: Keyboarding questions are separate from the questions of encoding. Some of the keyboards provided with Windows can been seen on Microsoft's Windows Keyboard Layout website. [AJ]

Q: I have specific questions about Tamil. Where are the answers?

A: See the Tamil FAQ.

Q: What are the Bengali characters used to transcribe the sound "a" (as in English "bat") in Unicode?

A: Bengali uses the symbol Bengali OAE glyph and some times Bengali EAE glyph to represent this sound when it begins a word. These symbols graphically appear to be made up of the letters Bengali O glyph (0958) or 098F (098F) plus a form Bengali ya-phalaa glyph known as ya-phalaa, and a final 09BE (09BE).

Ya-phalaa is the form the letter YA often takes when it is the last component of a consonant conjunct. E.g. TA+VIRAMA+YA may be displayed as TA+YA-PHALAA
In many cases a sequence,... +VIRAMA+YA may expected to produce a YA-PHALAA.

In view of the graphical appearance plus the common '+VIRAMA+YA' behavior, the recommendation is to encode these characters as follows:

Bengali example 2

If a candrabindu or other combining mark needs to be added in the sequence it comes at the end of the sequence. For example:

Bengali example 4

Q: Can you provide a clarification of Bengali Reph and Ya-phalaa usage?

A: The formation of the Reph form is defined in the Section 9.1, Rules for Rendering, R2 in the Unicode Standard. Basically, the Reph is formed when a Ra which has the inherent vowel killed by the virama/halant begins a syllable. This is shown in the following example.

Bengali reph example 1

The Ya-phalaa is a post-base form of Ya and I formed when the Ya is the final consonant of a syllable cluster. In this case, the previous consonant retains is base shape and the virama/halant is combined with the following Ya. This is shown in the following example.

Bengali reph example 2

An ambiguous situation is encountered when the combination of Ra + virama/halant + Ya is encountered.

Bengali reph example 3

To resolve the ambiguity with this combination and to have consistent behavior, we need to look at the processing order of the Bengali script. When parsing the text, the ability to form the Reph is identified first and therefore the Reph form should have priority in processing. Thus, it is necessary to insert a ZWNJ character into the stream between the Ra and virama/halant to allow the virama/halant and Ya to be grouped together during processing.

Bengali reph example 4

In the example above, the ZWNJ is used because we are saying that we want two characters that would join by default to remain as separate entities. In cases other than where the RA is the first character in the cluster the ZWNJ is not required for the formation of the Ya-phalaa. However, for ease of placing the Ya-phalaa input as a single key input, it should be permissible for the Ya-phalaa to be consistently formed by “ZWNJ + VIRAMA + YA” (U+200C + U+09CD + U+09AF). [PN]

Q: What about collation of Indic language data? Is that just a binary sort?

A: No. Collation order is not the same as code point order. A good treatment of some issues specific to collation in Indic languages can be found in the paper Issues in Indic Language Collation by Cathy Wissink.

Collation in general must proceed at the level of language or language variant, not at the script or codepoint levels. See also UTS #10: Unicode Collation Algortihm. Some Indic-specific issues are also discussed in that report.

Q: I cannot find the "half forms" of Devanagari letters (or any other Indic script) in the Unicode code charts. These characters are needed to form words such as "patni".

A: Unicode does not encode half or subjoined letters for the scripts of India. Like in the ISCII standard, Unicode forms all "consonant clusters" (such as the "tn" in "patni") by inserting the character "virama" (or "halant") between the two relevant consonant letters. For instance, the Devanagari syllable "tna" ("ligature tna") is encoded with the following code points:

U+0924 0924 DEVANAGARI LETTER TA
U+094D 094D DEVANAGARI SIGN VIRAMA (= halant)
U+0928 0928 DEVANAGARI LETTER NA

These three characters will be normally displayed using the single glyph tna ligature "dev-tna-ligature". But it is also possible that they are displayed using a half ta glyph followed by a full na glyph "dev-half-ta-na", or even with a full ta glyph combined with a virama glyph and followed by a full na glyph "dev-full-ta-virama-full-na".

Which form will be actually displayed is the decision of an underlying software module called a "display engine", which bases this decision on the availability of glyphs in the font.

If the sequence U+0924, U+094D is not followed by another consonant letter (such as "na") it is always displayed as a full ta glyph combined with the virama glyph "dev-ta-virama".

Unicode provides a way to force the display engine to show a half letter form. To do this, an invisible character called ZERO WIDTH JOINER should be inserted after the virama:

U+0924 0924 DEVANAGARI LETTER TA
U+094D 094D DEVANAGARI SIGN VIRAMA (= halant)
U+200D 200D ZERO WIDTH JOINER
U+0928 0928 DEVANAGARI LETTER NA

This sequence is always displayed as a half ta glyph followed by a full na glyph "dev-half-ta-na". Even if the consonant "na" is not present, the sequence U+0924, U+094D, U+200D is displayed as a half ta glyph "dev-half-ta".

Unicode also provides a way to force the display engine to show the virama glyph. To do this, an invisible character called ZERO WIDTH NON-JOINER should be inserted after the virama:

U+0924 0924 DEVANAGARI LETTER TA
U+094D 094D DEVANAGARI SIGN VIRAMA (= halant)
U+200C 200C ZERO WIDTH NON-JOINER
U+0928 0928 DEVANAGARI LETTER NA

This sequence is always displayed as a full ta glyph combined with a virama glyph and followed by a full na glyph "dev-full-ta-virama-full-na

For more detailed information, see Chapter 9 of the Unicode Standard, South Asian Scripts-I. For related issues, see "Where is My Character?" [MC]

Q: Bangla should be used in the Unicode Standard instead of Bengali. What can I do to correct the spelling?

A: The Unicode Standard does not constrain the names that people use for their own scripts, languages or characters. The particular labels used in the standard to identify characters and blocks are subject to stability constraints and cannot be changed. In the case of Bengali, annotations and explanations have been added to the standard regarding preferred names, such as Bangla. See Section 9.2, Bengali.

Q: I cannot find the Bengali khanda ta letter in the Unicode code charts. This character is needed to form words such as utkarsha.

A: The khanda ta letter was added to the Unicode Standard as of Version 4.1. It is encoded at: U+09CE BENGALI LETTER KHANDA TA. Use of this character is described in Section 9.2, Bengali.

Q: The Bangla "fullstop" is similar to the Devanagari danda (U+0964) both being taken from the Brahmi script, but the corresponding point in the Bengali block at U+09E4 is reserved. To write Bangla end of sentence (dari) what should I use?

A: All Unicode characters are equally accessible, and many punctuation elements are used across several scripts. You should use U+0964 as the danda for several scripts, including Bengali. Also U+0965 is the double danda for these scripts.

Q: Can you rename the character called VIRAMA in my script to HALANT?

A. In the Unicode Standard, the sign indicating the absence of an inherent vowel in Indic scripts is denoted by the Sanskrit word virama. In the particular languages another designation is often preferred. In Hindi, for example, the word hal refers to the character itself, and halant refers to the consonant that has its inherent vowel suppressed; in Tamil, the word pulli is used; in Bengali, the word hasant is used, and so on.

The Unicode stability policies prevent character names from being changed.
However, the code charts and character descriptions will contain annotations showing the preferred name, such as:

094D DEVANAGARI SIGN VIRAMA
= halant (the preferred Hindi name)
. suppresses inherent vowel

Q. KANNADA VOWEL SIGN I (U+0CBF) and KANNADA VOWEL SIGN E (U+0CC6) seem to have inconsistent character properties. They have General Category Mn and Bidi Class L. However, UAX #9 says that all Me and Mn category characters are Bidi Class NSM. Is this right?

A. Yes. This was an explicit decision by UTC for these characters, to preserve canonical equivalence under the Bidirectional Algorithm for two vowels involving these as parts of decompositions.

The BIDI algorithm is designed to maintain canonical equivalence. Normally all of the combining characters have the BIDI class NSM. There are combining characters that would cause problems for canonical equivalence, and are thus given different BIDI classes.

Q: How are the Sindhi implosives represented?

A. The characters U+097B DEVANAGARI LETTER GGA, U+097C DEVANAGARI LETTER JJA, U+097E DEVANAGARI LETTER DDDA, and U+097F DEVANAGARI LETTER BBA are used to write Sindhi implosive consonants, starting with Unicode 5.0. Previous versions of the Unicode Standard recommended representing those characters as a combination of the usual consonants with nukta, and anudatta, but those combinations are no longer recommended. [EM]