[Unicode]  Frequently Asked Questions Home | Site Map | Search

Ligatures, Digraphs and Presentation Forms

Q: What's the difference between a "Digraph" and a "Ligature"?

A: Digraphs and ligatures are both made by combining two glyphs. In a digraph, the glyphs remain separate but are placed close together. In a ligature, the glyphs are fused into a single glyph. [JC]

Q: I can't find the digraph "IE" in Unicode. Where do I look?

A: Look at "Where is my character?"

Q: Why are "my" presentation forms NOT included in Unicode?

The Unicode Standard encodes characters and it is the function of rendering systems to select presentation forms as needed to render those characters. Thus there is no need to encode presentation forms. [EM]

Q: Is it necessary to use the presentation forms that are defined in Unicode?

A: No, it is not necessary to use those presentation forms. Those forms were selected and identified in the early days of developing Unicode when sophisticated rendering engines were not prevalent. A selected subset of the presentation forms was included to provide users with a simple method to generate them. [MK]

Q: Can one use the presentation forms in a data file?

A: It is strongly discouraged and not recommended because it does not guarantee data integrity and interoperability. In the particular case of Arabic, data files should include only the characters in the Arabic block, U+0600 to U+06FF.  [MK]

Q: My language needs the digraph "xy". That digraph is distinctly different from "x" + "y" and is treated as a unit in my language. given all the available space in the Latin Extended area, why not just encode that digraph as another character? If my "xy" digraph were represented by a unique symbol, it would certainly be included with all the other letters from my language.

[Editor's note: "xy" is being used as a stand-in for particular digraphs from particular languages; this question, or something very similar to it, has been asked recently, for instance, about "ch" in Slovak, about "ng" in Tagalog, and about "ie" in Maltese.]

A: If the digraph "xy" were some strange symbol, then it would indeed be new; there would not be any existing way to represent it using already encoded Unicode characters. But it is not a strange symbol - it is just the digraph "xy", and there is already a way to represent it in Unicode: <U+0078, U+0079>.

While it may seem that there is a lot of available space for Latin letters in the Unicode Standard, and the upper- and lowercase versions of the digraph "xy" only constitute a couple of characters, in reality what's at stake here are not just a couple of characters, but hundreds. This is a recurrent pattern. There is a steady flow of requests for Latin digraphs or precomposed base form + diacritic letters for various languages. The reason is always some variant of "In my language 'xy' is a unit and not a sequence; it has its own behavior, and so should be encoded separately."

It isn't just the matter of the standards overhead faced by the Unicode Technical Committee for dealing with all these encoding proposals for letters than can already be represented by existing encoding characters used in sequences. There are deeper issues pertaining to the implications for existing implementations and existing data. If a new digraph "xy" is added, that implies the addition of a new compatibility decomposition "xy" to "x" "y" to the data tables. And that means people will have to revise their software to handle it. Then there is the fact that people will not represent data consistently; some will use the new digraph character and some will not - you can count on that. Existing data will not magically update itself to make use of the new digraph. Because of these considerations and others, there will be situations in which it will be necessary to represent data using the decomposed form anyway - as for example when passing around normalized data on the Internet. So the addition of a digraph character has a fairly substantial (and costly) set of consequences, in return for a minimal set of benefits. The net of this is generally negative, rather than positive. Multiply that by hundreds of times for all of the other digraphs and pre-composed diacritic-marked letters that exist for other languages (and with perhaps a couple of thousand languages in the world currently written with the Latin script, there are lots), and you can see why the Unicode Technical Committee does not favor heading down this path.

At this point, the UTC has a default position: no new characters for digraphs or pre-composed diacritic letters should be accepted for encoding as individual characters. If a convincing enough case can be presented, there may always be exceptions to that default position. To be convincing, the line of reasoning would have to be along the line of: There are demonstrable processing issues in the writing system for this language that cannot adequately be dealt with using the existing encoded characters, but which could be resolved by the addition of this new character. ("xy", or whatever.) But the arguments have to be very convincing, and other approaches to dealing with the perceived problem have to be explored and to be shown inadequate. For example, citation of a different sorting order for "xy" in a language is not very convincing, because well-known collation techniques are used to handle sorting of digraphic sequences in various languages; for sorting, the alternative approaches available for using weights for sequences of letters are preferable to having a separately encoded digraph, because those approaches are more general and extensible. [PC] & [KW]

Q: I have here a bunch of manuscripts which use the "hr" ligature (for example) extensively. I see you have encoded ligatures for "fi", "fl", and even "st", but not "hr". Can I get "hr" encoded as a ligature too?

A: The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.

Ligaturing is a behavior encoded in fonts: if a modern font is asked to display "h" followed by "r", and the font has an "hr" ligature in it, it can display the ligature. Some fonts have no ligatures, some (especially for non-Latin scripts) have hundreds. It does not make sense to assign Unicode code points to all these font-specific possibilities. [JC]

Q: What about the "ct" ligature? Is there a character for that in Unicode?

No, the "ct" ligature is another example of a ligature of Latin letters commonly seen in older type styles. As for the case of the "hr" ligature, display of a ligature is a matter for font design, and does not require separate encoding of a character for the ligature. One simply represents the character sequence <c, t> in Unicode and depends on font design and font attribute controls to determine whether the result is ligated in display (or in printing).
The same situation applies for ligatures involving long s and many others found in Latin typefaces.

Remember that the Unicode Standard is a character encoding standard, and is not intended to standardize ligatures or other presentation forms, or any other aspects of the details of font and glyph design. The ligatures which you can find in the Unicode Standard are compatibility encodings only—and are not meant to set a precedent requiring the encoding of all ligatures as characters. [KW]

Q: What are all those duplicated math alphabet characters FOR? Wouldn't it have made more sense to simply have introduced a few new combining characters in plane 0, such as: "make bold", "make italic", "make script", "make fraktur", "make double-struck", "make sans serif", "make monospace" and "make tag". This would not only have achieved the same effect (and with the same space requirements too, at least for things like "bold uppercase A" in UTF-16), but with much greater flexibility (in that you could also make other characters bold too, and you could create combinations of the attributes not currently represented).

A: It would have provided too much flexibility, and would have tempted people to use such characters to create "poor man's markup" schemes rather than using proper markup such as SGML/HTML/XML. The mathematical letters and digits are meant to be used only in mathematics, where the distinction between a plain and a bold letter is fundamentally semantic rather than stylistic. [JC]

Q: Why doesn't Unicode have a full set of superscripts and subscripts?

A: Unicode includes true superscripted Latin characters for round-trip compatibility with other standards. Unicode also includes other characters which look like and are typographically derived from superscripted Latin or Greek characters, such as U+02B0 MODIFIER LETTER SMALL H. Despite their appearance, these are not true superscripts and should not be used as such. The situation is the same for subscripts.

Unicode considers true superscripts and subscripts to be a matter of rich text formatting and, as such, out of the standard's scope. [JJ]

Q: What is the difference between "rich text" and "plain text"?

A: Rich text is text with all its formatting information: typeface, point size, weight, kerning, and so on. Plain text is the underlying content stream to which formatting is applied.

One key distinction between the two is that rich text breaks the text up into runs and applies uniform formatting to each run. As such, rich text is inherently stateful. Plain text is not stateful. It should be possible to lose the first half of a block of plain text without any impact on rendering.

Unicode, by design, only deals with plain text. It doesn't provide a generalized solution to rich text issues. [JJ]

Q: I'm reading a book which uses italic text to mean something distinct from roman text. Doesn't that mean that italics should be encoded in Unicode?

No. It's common for specific formatting to be used to convey some of the semantic content—the meaning—of a text. Unicode is not intended to reproduce the complete semantic content of all texts, but merely to provide plain text support required by minimum legibility for all languages. [JJ]

Q: What does "minimum legibility" mean?

Minimum legibility refers to the minimum amount of information necessary to provide legible text for a given language and nothing more. Minimally legible text can have a wide range of default formatting applied by the rendering system and remain recognizably text belonging to a certain language as generally written. [JJ]