Unicode Frequently Asked Questions

Arabic Script

Q: How are Arabic letters represented in Unicode?

In normal writing, the Arabic script employs the consonantal base letters only and omits the vowels. When vowels are written, combining marks that represent the vowels are applied to the base letter.

As the Arabic script has been adapted for writing new languages, often diacritical marks known as ijam are added to the "skeletal" consonantal letterforms in order to differentiate additional sounds (or letters) as needed. The creation of base letter plus diacritics is an ongoing process at work in language communities today. As new combinations are attested in language communities, the new letterforms are encoded as a unit in Unicode. The ijam diacritical marks are not encoded separately in the Unicode Standard. [LM]

Q: Why aren't Arabic ijam diacritical marks separately encoded?

The reasons for encoding the new letterforms as a unit and not encoding combining diacritical marks separately are historic, due to the evolution of the Unicode Standard. Although vowels, Koranic marks, and other pronunciation marks have been encoded as combining marks, the consonantal base letters have consistently been encoded in Unicode as a unit. To change this practice would open the door to multiple representations for the same letters.

The Unicode Standard provides a unique normalized representation for text, even when both precomposed and decomposed forms exist. This model is used for Latin and other scripts. However, to provide stability for the wide range of products that use Unicode, the normalized forms cannot change. For this reason, decomposed characters for Arabic cannot be added without having duplicate representations, which would cause serious implementation problems, including security issues. Thus, the decision was made to keep the representation of Arabic base letterforms to indivisible units. [LM]

Q. Why are Arabic presentation forms encoded?

Arabic presentation forms are encoded for compatibility only, and are not recommended for use in regular Arabic text. Nor are they intended as a guide to the development of appropriate Arabic fonts. Arabic font designers should do whatever is necessary to add the full range of glyphic support to the fonts they develop. See also Presentation Forms.

Q: Unicode includes presentation forms for Arabic, Urdu and Persian letters, but not for letters added for Jawi (Malay written in the Arabic script). Will presentation forms be added for Jawi?

No, they won't. Arabic presentation forms for isolated, medial, initial, and final positional variants were added to the standard primarily for compatibility with some older, legacy character sets that encoded presentation forms directly. That style of text encoding is not encouraged by the Unicode Standard.

Positional variants of Arabic letters are handled by analyzing context when rendering text. Specific glyphs for each position (isolated, medial, initial, and final—or just isolated and final, depending on the letter) need to be defined properly in the font, of course, but no separate character code is required for that.

Q: I'm having trouble identifying the correct Unicode characters for some Jawi letters. Can you help?

Sure. Use U+06A0 for Jawi nga, U+06BD for Jawi nya, U+0762 for Jawi ga, and U+06CF for Jawi vi. Note that U+0762 for ga takes the shaping of the Persian/Urdu gaf (= U+06AF), but with a dot above, instead of a line above the letter skeleton. The letter U+06AC (a kaf with a dot above) is also sometimes used for the Jawi ga, but is not the preferred representation.

Q: How do I get signs spanning numbers in Arabic, such as End of Ayah U+06DD ۝, to work properly with digits?

These characters are intended to enclose or hold one or more digits (including European, Arabic-Indic, and Eastern Arabic-Indic digits). Many applications are able to display these properly, just by typing the spanning signs (such as U+06DD end of ayah) before the digit(s).

These may not yet be fully supported in all applications. For further suggestions, see: https://software.sil.org/arabicfonts/support/faq/#Ayahexternal link

Q:Why aren't Urdu digits separately encoded?

There is some variation in the shapes of Eastern Arabic-Indic Digits for Persian, Sindhi, Urdu, and Kashmiri. The characters affected are U+06F4 EXTENDED ARABIC-INDIC DIGIT FOUR, U+06F6 EXTENDED ARABIC-INDIC DIGIT SIX, and U+06F7 EXTENDED ARABIC-INDIC DIGIT SEVEN. Rather than encoding an entirely separate set of digits for each of these languages, which would complicate numerical processing, the different glyph shapes are simply considered variant forms of the same characters, whose display can be handled by language-specific font meachnisms. The shapes in question are illustrated in Table 9-2 in The Unicode Standard.