Characters and Combining Marks
Combining Marks and Combining Character Sequences
Q: Does “text element” mean the same as “combining character sequence”?
The concept of text element is the more generic one; it describes any sequence of one or more characters that are treated as a unit by some process. In contrast, a combining character sequence has a more restricted definition. Typically, it is a base character followed by one or more combining characters. A combining character sequence is one type of a text element, but individual characters, words and sentences are also examples of text elements.
Q: Is a combining character sequence ever treated the same as a “character”?
That depends. For a programmer, a Unicode code point represents a single character (for exceptions, see below). For an end user, it may not. The better word for what end-users think of as characters is grapheme: a minimally distinctive unit of writing in the context of a particular writing system.
For example, the letter “å” (whether represented as A + COMBINING RING or precomposed as A WITH RING) is a grapheme in the Danish writing system, while the syllable “क्तु” (represented by the sequence KA + VIRAMA + TA + VOWEL SIGN U) is a single grapheme in the Devanagari writing system. Graphemes are not necessarily combining character sequences, and combining character sequences are not necessarily graphemes.
Finally, there are a number of other cases where a user would not count “characters” the same way as a programmer would; for example where there are invisible characters such as the RIGHT-TO-LEFT MARK (RLM) used in BIDI, compatibility composites such as “Dz”, “ij”, or Roman numerals, and so on.
Q:Which character should I use for a given diacritical mark?
Normally, you would select the combining character that matches the desired appearance and placement relative to the base character.
Diacritic marks are not encoded by function, and are not specific to language or usage. Take the acute accent, for example. In some languages, it is a diacritic to indicate a distinct letter (with a distinct pronunciation); in other languages it marks a stress, or a quantity; in others it marks a tone. The implications for linguistic processing (including sorting) may be different in each case. In all cases, unless a precomposed character is used, it is encoded as U+0301 COMBINING ACUTE ACCENT. Similarly, the U+0308 COMBINING DIAERESIS may be used for diaeresis, trema, umlaut, as well as other, possibly unrelated uses.
Encoding separate diacritics for each function, that are otherwise indistinguishable, would lead to confusion. There would be texts that look correct but were typed with the wrong character code. Moreover, splitting the encoding of combining marks would create a problem of interworking with legacy precomposed letters, for which all functions share a single character. An example are the precomposed letters from the ISO 8859 family of 8-bit character sets widespread in European implementations. In ISO 8859-1 Latin-1 data, the letter “ö” is simply encoded as 0xF6, regardless of whether it is being used in the Dutch word “coördinaten” as an o with trema, or in the German word “böse” as an o-umlaut.
Q: Do I always use U+0323 COMBINING DOT BELOW when I need to put a dot under a character?
Some combining marks are intended for use with a specific script. So, for instance, to write a letter in Hindi with a dot below you would use U+093C DEVANAGARI SIGN NUKTA, and to write a pointed letter in Hebrew with a hiriq dot below you would use U+05B4 HEBREW POINT HIRIQ. In other cases, such as Latin characters with a dot below, you would use U+0323 COMBINING DOT BELOW.
Equivalent Sequences
Q: Why are there no compatibility decompositions defined for characters that seem to suggest them?
Many characters such as the following are “confusables” rather than compatibility characters.
2044 (FRACTION SLASH) → 002F (SOLIDUS) 2010 (HYPHEN) → 002D (HYPHEN-MINUS) 2013 (EN DASH) → 002D (HYPHEN-MINUS) 2014 (EM DASH) → 002D 002D (HYPHEN-MINUS, HYPHEN-MINUS)
They are characters that look similar, but have distinct behavior and generally distinct appearance (whether in length or angle). Consult the Unicode Standard for descriptions of the differences between these characters.
Compatibility characters are mostly particular presentation forms of another character (or sequence of characters), encoded to ensure round-trip conversion to legacy encodings.
Q: Why don't all the compatibility ideographs have equivalents?
In general, all compatibility ideographs are canonically equivalent to one of the unified ideographs, meaning, the distinction disappears with normalization. (To select the specific graphical appearance of the compatibility ideographs use variation sequences).
However, there are 12 ideographs that are not duplicates and should be treated as a small extension of the set of unified ideographs. Therefore they have no canonical equivalents. (They are FA0E, FA0F, FA11, FA13, FA14, FA1F, FA21, FA23, FA24, FA27, FA28, and FA29). They are derived from industry standards instead of any of the preexisting national standards, which for historical reasons didn't make it into the main CJK unified ideograph block. [JC]
Q: Do all the Unicode character set mappings cover control codes?
No, the control code mappings are often omitted from the tables on the Unicode site. For the ASCII family of character sets, these are usually one-to-one mappings from the Unicode set based on taking the lower 8 bits of the Unicode character. However, they may differ significantly for other sets, such as EBCDIC.
The correct Unicode mappings for the special graphic characters (01-1F, 7F) of CP437 and other DOS-type code pages are available at https://www.unicode.org/Public/MAPPINGS [JC]
Q: How are characters counted when measuring the length or position of a character in a string?
Computing the length or position of a "character" in a Unicode string can be a little complicated, as there are four different approaches to doing so, plus the potential confusion caused by combining characters. The correct choice of which counting method to use depends on what is being counted and what the count or position is used for.
Each of the four approaches is illustrated below with an example string <U+0061, U+0928, U+093F, U+4E9C, U+10083
>. The example string consists of the Latin small letter a, followed by the Devanagari syllable "ni" (which is represented by the syllable "na" and the combining vowel character "i"), followed by a common Han ideograph, and finally a Linear B ideogram for an "equid" (horse):
1. Bytes: how many bytes (what the C or C++ programming languages call a char
) are used by the in-memory representation of the string; this is relevant for memory or storage allocation and low-level processing.
Here is how the sample appears in bytes for the encodings UTF-8, UTF-16BE, and UTF-32BE:
Encoding | Byte Count | Byte Sequence |
---|---|---|
UTF-8 | 14 | 61 E0 A4 A8 E0 A4 BF E4 BA 9C F0 90 82 83 |
UTF-16BE | 12 | 00 61 09 28 09 3F 4E 9C D8 00 DC 83 |
UTF-32BE | 20 | 00 00 00 61 00 00 09 28 00 00 09 3F |
2. Code units: how many of the code units used by the character encoding form are in the string; this may be relevant, for example, when declaring the size of a character array or locating the character position in a string. It often represents the "length" of the string in APIs.
Here is how the sample appears in code units for the encodings UTF-8, UTF-16, and UTF-32:
Encoding | Code Unit Count | Code Unit Sequence |
---|---|---|
UTF-8 | 14 | 61 E0 A4 A8 E0 A4 BF E4 BA 9C F0 90 82 83 |
UTF-16 | 6 | 0061 0928 093F 4E9C D800 DC83 |
UTF-32 | 5 | 00000061 00000928 0000093F 00004E9C 00010083 |
3. Code points: how many Unicode code points—the number of encoded characters — that are in the string. The sample consists of 5 code points (U+0061, U+0928, U+093F, U+4E9C, U+10083
), regardless of character encoding form. Note that this is equivalent to the UTF-32 code unit count.
4. Grapheme clusters: how many of what end users might consider "characters". In this example, the Devanagari syllable "ni" must be composed using a base character "na" (न) followed by a combining vowel for the "i" sound ( ि), although end users see and think of the combination of the two "नि" as a single unit of text. In this sense, the example string can be thought of as containing 4 “characters” as end users see them. A default grapheme cluster is specified in UAX #29, Unicode Text Segmentation, as well as in UTS #18, Unicode Regular Expressions.
The choice of which count to use and when depends on the use of the value, as well as the tradeoffs between efficiency and comprehension. For example, Java, Windows, and ICU use UTF-16 code unit counts for low-level string operations, but also supply higher level APIs for counting bytes, characters, or denoting boundaries between grapheme clusters, when circumstances require them. An application might use these to, say, limit user input based on a number of "screen positions" using the user-perceived "character" (grapheme cluster) count. Or the application might have an internal limit based on storage allocation in a database field counted in bytes. This approach allows for efficient low-level processing, with allowance for higher-level usage. However, for a very high-level application, such as word-processing macros, grapheme clusters alone may be sufficient.
Q: What should I expect for canonically equivalent sequences?
Canonical equivalence essentially means that a Unicode-conformant process should not insist that two canonically equivalent sequences should be treated differently in a way that implies a difference in character meaning. Thus, it would be non-conformant for Process A to hand Process B a <00C1> meaning a pre-composed a-acute, for Process B to acknowledge that it got <0041 0301>, meaning an a-acute represented as a sequence, and then for Process A to insist that Process B is non-conformant. That insistence would be non-conformant, since Process B was within its rights, by virtue of canonical equivalence, to choose the combining sequence as representing a-acute. (See also "Q: Are there cases where a Unicode-conformant process may treat two canonically equivalent sequences")
Q: Are there cases where a Unicode-conformant process may treat two canonically equivalent sequences differently in any way?
Such cases exist, and they are easiest to understand with an example:
The single character sequence <00C1> A WITH ACUTE and the sequence <0041 0301> A + COMBINING ACUTE are canonically equivalent sequences. However, that doesn't mean that “no Unicode-conformant process is allowed to treat them differently in any way.” A Unicode-conformant process might not interpret combining marks, in which case it would interpret <0041 0301> as a sequence of <0041> plus an uninterpreted character. This would be different from its interpretation of <00C1> as an interpreted character.
There is no guarantee that any process will interpret all canonically equivalent sequences, but if it does, the expectation is that they are interpreted as having the same meaning as characters. However, there are processes that deal with other aspects of data, and for those, even a process that interprets all canonically equivalent sequences will necessarily apply some differences. For example, a Unicode-conformant process allocating a buffer for character storage clearly has to treat <00C1> and <0041 0301> differently, since the amount of storage required differs.
Precomposed Characters vs. Precomposed Glyphs
Q: My language needs a precomposed character, but only the base character and accent are available in Unicode.
See Where Is My Character? and the question My language needs the digraph “xy”. Why is it not encoded as a single character?.
Q: Unicode doesn't contain the character I need, which is a Latin letter with a certain diacritical mark. Can you add it?
Unicode can already express almost anything you will ever need in any field of study by using a combination of Latin, IPA, or other base letters with the various combining diacritical marks. For example, if you need a highly specialized character such as “Z with stroke, cedilla, and umlaut”, you can get this combination by using three existing character codes in combination:
U+01B5 LATIN CAPITAL LETTER Z WITH STROKE
U+0327 COMBINING CEDILLA
U+0308 COMBINING DIAERESIS
With appropriate rendering software, that sequence should produce a glyph combination like this:
Even if the combination is not available in a particular font, it is unambiguous and Unicode conformant systems should transmit and retain the sequence without distortion, and it may be processed programmatically.
The Navajo-specific question below is also applicable to a wide variety of similar cases.
Q: Unicode doesn't contain some of the precomposed characters needed for Navajo and other indigenous languages of the Americas. Will you add them?
The way to encode the various Navajo letters with diacritics is with the use of combining marks. For example, Navajo high-toned nasalized vowels:
a + ogonek + acute = <U+0061, U+0328, U+0301> ( ǫ́ )
and so on for the other vowels.
U+0328 is the combining ogonek, and U+0301 is the combining acute accent. (Navajo orthography uses the ogonek, which is the hook to the right, for nasalization; that is not the same as the cedilla, which is the hook to the left. See the difference between U+0119 e-ogonek, and U+0229 e-cedilla.)
In Unicode Normalization Form C, the a and the ogonek would be replaced by the single code for a-ogonek, producing:
a + ogonek + acute → a-ogonek + acute = <U+0105, U+0301> ( ą́ )
i + ogonek + acute → i-ogonek + acute = <U+012F, U+0301> ( į́ )
For display and printing, these combinations should just show the whole letters, with both accents placed properly. Most modern browsers and operating systems do that automatically and correctly for you, as shown for the actual character sequences in parentheses in the examples above.
See also the web page Where Is My Character?
Q: When I use a combining sequence, why does it not display correctly?
I'm trying to display “X” with a circumflex using this sequence: <U+0058, U+0302>. But it doesn't display correctly. The circumflex comes out misplaced, not properly over the “X”.
Your problem is most likely a limitation of the layout engine and/or font you are using. The real question is what repertoire of base+accent combinations your layout engine and fonts are supporting for display. Fonts that properly support a repertoire with the combination you need should have the correct display.
If the font doesn't support the repertoire, you can end up with various glitches in display. Exactly how things appear in that case will depend on internal details regarding how the font may handle combining marks.
To compare the possible displays of sequences with those that could have resulted if X-circumflex had been encoded as a precomposed character, see the following table.
Precomposed Character | Combining Character Sequence | |
---|---|---|
Font Supports Repertoire | ||
Font Does not Support Repertoire |
Some fonts, such as the Doulos and Charis fonts, which are freely available for download, contain large repertoires of appropriate precomposed glyphs for use by linguists and writers of minority languages. Try checking out those fonts to see if they might cover your repertoire needs. See also DisplayProblems.
Q: Just how hard is it for a font designer to support a sequence like X+circumflex, compared to supporting a precomposed character?
With modern font technologies, such as OpenType and AAT, the difference is relatively small. For example, in OpenType, it is a matter of adding an entry for the sequence in a ligature table, such as is discussed in the VOLT and InDesign Tutorial. There is no fundamental need for a precomposed character to be encoded in the standard at all in order for the font to have and display the correct precomposed glyph for the combination you need.
The hard work, in either case, is in the design for the precomposed glyph. Conceptually it seems simple enough to add a precomposed glyph to a font — after all, typically the base glyph will be in the font already. But professional font design requires considerable effort. Any time a new accented glyph is added, attention must be paid to design integrity compared to other accented glyphs, kerning issues with all other glyphs, and the possible need for yet other ligatures. Most of this work then has to be repeated for each face of the font: bold, italics, smallcaps, and their combinations. The amount of work for testing the font is multiplied many fold, because not only does the new glyph need testing by itself, but also in interaction with the other glyphs in the font. This is the fundamental reason why commercial fonts are relatively slow to adopt large new collections of precomposed glyphs into their supported repertoires.
Q: Is there a way for font designers to provide flexible support for arbitrary accented combinations?
Yes, many modern fonts support dynamic positioning of diacritical marks using aligning anchors on base and mark glyphs or similar mechanisms. For example, such mechanisms are defined in the OpenType font specification, and many fonts in Windows 7 and later versions have this feature. Other systems, such as Mac OS X, can provide such dynamic display even in the absence of explicit font support.
Q: Why are new combinations of Latin letters with diacritical marks not suitable for addition to Unicode?
There are several reasons. First, Unicode encodes many diacritical marks, and the combinations can already be produced, as noted in the answers to some questions above. If precomposed equivalents were added, the number of multiple spellings would be increased, and decompositions would need to be defined and maintained for them, adding to the complexity of existing decomposition tables in implementations.
Finally, normalization form NFC (the composed form favored for use on the Web) is frozen—no new letter combinations can be added to it. Therefore, the normalized NFC representation of any new precomposed letters would still use decomposed sequences, which can already be expressed by combining character sequences in Unicode. Nothing would be gained by adding the letter with diacritical mark as a precomposed character; on the contrary, adding such a letter would add one or more multiple spellings to be reckoned with, incrementally complicating all Unicode implementations for no net gain.
Combining Grapheme Joiner and Multiple Base Characters
Q: Is U+034F COMBINING GRAPHEME JOINER a combining mark?
The CGJ is not a format control character, but rather a combining mark. It has the General_Category value gc=Mn and the Canonical_Combining_Class value ccc=0. The presence of a COMBINING GRAPHEME JOINER in the midst of a combining character sequence does not interrupt the combining character sequence.
Q: Does U+034F COMBINING GRAPHEME JOINER affect display of combining character sequences?
The CGJ neither impacts cursive joining nor ligation (in contrast to U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER). And the CGJ does not have any visible display of its own. Of course, as for any such character in the Unicode Standard with no visible display, it is always possible to use a visible glyph when deliberately showing hidden characters, as for an text editor's Show Symbol or Show Hidden mode.
Q: Does U+034F COMBINING GRAPHEME JOINER join graphemes?
No. Despite its name, the COMBINING GRAPHEME JOINER neither joins graphemes together in the way punctuation might, nor does it create new graphemes by combinations of other characters. Especially, it cannot be used to construct grapheme clusters out of arbitrary character sequences, or extend the scope of subsequent combining characters. It has no impact on line breaking, except that as for other combining marks, it should be kept with its base when breaking a line.
Q: What is the function of U+034F COMBINING GRAPHEME JOINER?
It has several functions: it is used to affect the collation of adjacent characters for purposes of language-sensitive collation, searching, and matching, and used to distinguish sequences that would otherwise be canonically equivalent.
In collation, the primary function is to prevent contractions from forming. Thus, for example, while “ch” is sorted as a single unit in a tailored Slovak collation, the sequence <c, CGJ, h> will sort as a “c” followed by an “h”. This usage requires no tailoring of either the COMBINING GRAPHEME JOINER or the sequence. (It is possible to give sequences of characters which include the COMBINING GRAPHEME JOINER special tailored weights; however, such an application of CGJ is not recommended.)
Second, the insertion of a COMBINING GRAPHEME JOINER into a sequence of combining marks will block reordering of those combining marks when canonical ordering is applied. This can be used in some unusual circumstances where two sequences of combining marks need to be distinguished, but where the different sequences would be neutralized by normalization. For example, the sequence of Hebrew points <hiriq, patah> can be distinguished from the sequence <patah, hiriq> by inserting a COMBINING GRAPHEME JOINER: <patah, CGJ, hiriq>. The presence of the CGJ would prevent reordering of that sequence to <hiriq, patah>, thus enabling a reliable distinction to be maintained. Such usage will also cause differences in collation for the affected sequences.
Q: Unicode doesn't seem to distinguish between tréma and umlaut, but I need to distinguish. What shall I do?
For some purposes, it may be necessary to maintain a distinction between tréma and umlaut, for example, in bibliographic records kept by the German library network. For the Latin script, the Unicode Standard does not distinguish identically appearing diacritical marks with different functions. Doing so would result in confusion in implementations and among users. (See also “Q:Which character should I use for a given diacritical mark?”).
The character U+034F COMBINING GRAPHEME JOINER (CGJ) may be used to make the relevant sorting, searching, and data mapping distinctions required for umlaut versus tréma. The semantics of CGJ are such that it should impact only searching and sorting, for systems which have been tailored to distinguish it, while being otherwise ignored in interpretation. The CGJ character was encoded with this purpose in mind.
The sequences <a, umlaut> and <a, CGJ, umlaut> are not canonically equivalent. This means that the distinction will not be normalized away on conversion in and out of bibliographic systems. This eases the interoperability problem. Both sequences will display as they should.
Implementations which need to distinguish the two for searching and sorting may systematically maintain weighting distinctions. <a, umlaut> = <ä> can be treated as equivalent to <a, e> for sorting purposes, while the tréma <a, CGJ, umlaut> can be weighted as a secondary variant of <a> thus resulting in the desired behavior for such systems. Existing collations which do not distinguish tréma and umlaut in their data will continue to work exactly as they currently do, since in default collation tables CGJ is ignored in weighting.
Existing collation, searching, and matching based on the Unicode Collation Algorithm will continue to behave as originally specified: they will not distinguish tréma and umlaut in German data. Only collation tables that add new weights for the sequence <CGJ, umlaut> will distinguish between that and a plain umlaut.
Q: Is it possible to apply a diacritic or combining enclosing mark to a sequence of more than one (non-combining) character?
No, with the exception of the “double diacritics” deliberately designed to be applied onto a two letter sequence, e.g. U+035D COMBINING DOUBLE BREVE. Neither ZWJ (U+200D ZERO WIDTH JOINER) nor CGJ (U+034F COMBINING GRAPHEME JOINER) “glue” characters together in a way that the scope of any following combining character would be affected. To get a character sequence like “Esc” into something like the U+20E3 COMBINING ENCLOSING KEYCAP, you must resort to higher-level protocols. [KP]
Egyptological Yod and Using @ as a Letter
Q: What sequences of characters should I use for the Egyptological yod, which appears as an italic i with a half-ring diacritic above it?
As of Unicode 12.0, a dedicated character is encoded for the Egyptological yod: U+A7BD LATIN SMALL LETTER GLOTTAL I (with its uppercase counterpart, U+A7BC LATIN CAPITAL LETTER GLOTTAL I). This is an atomic character — not decomposable. It is documented as the preferred usage for Egyptological yod. Fonts which support it should provide proper italic forms for display.
Earlier versions of the Unicode Standard recommended representation of Egyptological yod by means of a sequence of U+0069 LATIN SMALL LETTER I followed by one of three possible diacritics: U+0313 COMBINING COMMA ABOVE, U+0357 COMBINING RIGHT HALF RING ABOVE, or U+0486 COMBINING CYRILLIC PSILI PNEUMATA. However, appropriate shaping of those sequences, particularly when using italic style, has not generally been well supported in fonts. Disagreement among Egyptologists as to which of those diacritics was semantically correct for this sequence also contributed to a lack of interoperability. Now, with U+A7BD available, that atomic character is the preferred choice for the Latin transliteration used for Egyptology.
Q: What should I do if I encounter Egyptological data containing the older sequences?
If continued small anomalies for display, especially in italicized text, are not a concern for you, then it is safe to just leave the sequences in the data as they are. For optimal display and for printing, it may be preferable to convert such sequences to the new character, U+A7BD LATIN SMALL LETTER GLOTTAL I, once this character is supported in the fonts you use. In any case, when processing Egyptological transliteration data, it is advisable to be aware of the various possible sequences which might be in use, so that appropriate equivalences can be made for searching and matching operations. Note that none of the older sequences would be automatically normalized to the new character for Egyptological yod.
Q: Are there other Unicode characters with issues similar to Egyptological yod?
Yes. Similar transliteration conventions also occur in Ugaritic studies, but affect the letters a and u, as well as the letter i. To cover those conventions, the Unicode Standard has also encoded atomic characters with these glottal diacritics: U+A7BB LATIN SMALL LETTER GLOTTAL A and U+A7BF LATIN SMALL LETTER GLOTTAL U, as well as their uppercase equivalents. The behavior and display of those characters, also typically used in italic style, are similar to that of the Egyptological yod.
Q: I am digitizing textual materials for a language whose script contains a small letter "@", as well as a capitalized version, depicted as the letter "A" with a circle around it. Which Unicode characters should I use to represent these?
The Unicode Standard does not contain a small letter character for "@", apart from the widely used "at" sign symbol itself, U+0040 COMMERCIAL AT. Nor does it contain a capitalized letter corresponding to the "at" sign symbol. The UTC has declined to encode separate letters for these or to create a case pairing for the existing "at" sign symbol, because of the potential for confusion and/or spoofing involving the "@"—a very common syntax character in email and many other functions.
Such language material could be represented by using the existing circled letter symbols, U+24D0 ⓐ CIRCLED LATIN SMALL LETTER A and U+24B6 Ⓐ CIRCLED LATIN CAPITAL LETTER A. These have the advantage of being already encoded and widely available in fonts. Additionally, those two symbols already form a case pair in the standard, which means that case mapping and other casing operations (including case-insensitive searching) involving the digitized material should work correctly. Although the default glyphs for the small circled a and capital circled a in most fonts might not have the optimal appearance, fonts can be adjusted for special purposes such as publication, to produce the desired appearances of the characters.