[Unicode]  Frequently Asked Questions Home | Site Map | Search

Middle Eastern Scripts and Languages

Q. Why are Arabic presentation forms encoded?

A. Arabic presentation forms are encoded for compatibility only, and are not recommended for use in regular Arabic text. Nor are they intended as a guide to the development of appropriate Arabic fonts. Arabic font designers should do whatever is necessary to add the full range of glyphic support to the fonts they develop.

Q: Can one use the Arabic presentation forms in a data file?

A: It is strongly discouraged and not recommended because it does not guarantee data integrity and interoperability. Data files should include only the Arabic letters in the Arabic block (U+0600..U+06FF) or the Arabic Supplement block (U+0750..U+077F). Also see Ligatures and Diagraphs. [MK]

Q: How are Arabic letters represented in Unicode?

A: In normal writing, the Arabic script employs the consonantal base letters only and omits the vowels. When vowels are written, combining marks that represent the vowels are applied to the base letter.

As the Arabic script has been adapted for writing new languages, often modifier marks are added to the "skeletal" consonantal letterforms in order to differentiate additional sounds (or letters) as needed. The creation of base letter plus modifier combinations is an ongoing process at work in language communities today. As new combinations are attested in language communities, the new letterforms are encoded as a unit in Unicode. The modifier forms themselves are not encoded separately in the Unicode Standard. [LM]

Q: Why aren't Arabic combining modifier letters separately encoded?

A: The reasons for encoding the new letterforms as a unit and not encoding combining modifier forms separately are historic, due to the evolution of the Unicode Standard. While vowels, Koranic marks, and other pronunciation marks have been encoded as combining marks, the consonantal base letters have consistently been encoded in Unicode as a unit. To change this practice would open the door to multiple representations for the same letters.

The Unicode Standard provides a unique normalized representation for text, even when both precomposed and decomposed forms exist. This model is used for Latin and other scripts. However, to provide stability for the wide range of products that use Unicode, the normalized forms cannot change.
For this reason, decomposed characters for Arabic cannot be added without having duplicate representations, which would cause serious implementation problems, including security issues. Thus, the decision was made to keep the representation of Arabic base letterforms to indivisible units. [LM]

Q: Unicode includes presentation forms for Arabic, Urdu and Persian letters, but not for letters added for Jawi (Malay written in the Arabic script). Will presentation forms be added for Jawi?

A: No, they won't. Arabic presentation forms for isolated, medial, initial, and final positional variants were added to the standard primarily for compatibility with some older, legacy character sets that encoded presentation forms directly. That style of text encoding is not encouraged by the Unicode Standard. Instead, all Arabic text (including Jawi) should be represented using the Arabic letters in the Arabic block (U+0600..U+06FF) or the Arabic Supplement block (U+0750..U+077F).

Positional variants of Arabic letters are handled by analyzing context when rendering text. Specific glyphs for each position (isolated, medial, initial, and final—or just isolated and final, depending on the letter) need to be defined properly in the font, of course, but no separate character code is required for that.

Q: I'm having trouble identifying the correct Unicode characters for some Jawi letters. Can you help?

Sure. Use U+06A0 for Jawi nga, U+06BD for Jawi nya, U+0762 for Jawi ga, and U+06CF for Jawi vi. Note that U+0762 for ga takes the shaping of the Persian/Urdu gaf (= U+06AF), but with a dot above, instead of a line above the letter skeleton. The letter U+06AC (a kaf with a dot above) is also sometimes used for the Jawi ga, but is not the preferred representation.