Characters and Combining Marks
- Does “text element” mean the same as “combining
- So is a combining character sequence the same as a
- I would think that certain characters would have
compatibility decompositions. Why don't they?
- Do all the compatibility ideographs have equivalents?
- Do all the Unicode character set mappings cover
- Is the POSIX ctype.h model sufficient for Unicode?
- How are characters counted when measuring the length or position of a character in a string?
- Doesn't canonical equivalence mean that no
Unicode-conformant process can treat canonically equivalent sequences
differently in any way?
- My language needs a precomposed character, but only
the base character and accent are available in Unicode.
- I can't find the diacritical mark I need, but
Unicode contains one that looks the same but has a different function.
Can you add the one I need?
- Do I always use U+0323 COMBINING DOT BELOW when I need to put
a dot under a character?
- Unicode doesn't contain the character I need, which
is a Latin letter with a certain diacritical mark. Can you add it?
- Unicode doesn't contain some of the precomposed
characters needed for Navajo and other indigenous languages of the
Americas. Will you add them?
- Yes, I can represent (for example) X with
circumflex by use of X with a combining circumflex: <U+0058,
U+0302>. But it doesn't display correctly. The circumflex comes out
misplaced, not properly over the “X”.
- Just how hard is it for a font designer to support a sequence
like X+circumflex, compared to supporting a precomposed character?
- Is there a way for font designers to provide
flexible support for arbitrary accented combinations?
- Why are new combinations of Latin letters with
diacritical marks not suitable for addition to Unicode?
- Is U+034F COMBINING GRAPHEME JOINER a combining mark?
- Does U+034F COMBINING GRAPHEME JOINER affect display of combining character sequences?
- Does U+034F COMBINING GRAPHEME JOINER join graphemes?
- What is the function of U+034F COMBINING GRAPHEME
- Unicode doesn't seem to distinguish between tréma and umlaut, but I need to distinguish. What shall I do?
- Is it possible to apply a diacritic or combining enclosing mark to a sequence of more than one (non-combining) character?
- I’ve looked through the Unicode code charts and don’t see the combination I need for transliterating Egyptological yod, which is made up of the letter i (/I) plus a half-ring-looking diacritic above it. What should I do?
- Should I expect issues, when using this approach for representation of Egyptological yod?
- Should I use the combining diacritic over U+0131 LATIN SMALL LETTER DOTLESS I, or should
I use U+0069 LATIN SMALL LETTER I?
- I am digitizing textual materials for a language whose script contains a small letter "@", as well as a capitalized version, depicted as the letter "A" with a circle around it. Which Unicode characters should I use to represent these?
Q: Does “text element” mean the same as
“combining character sequence”?
A: No, this is a common misperception. A text element
just means any sequence of characters that are treated as a unit by some
process. A combining character sequence is a base
character followed by any number of combining characters. It is one type
of a text element, but words and sentences are also examples of
Q: So is a combining character sequence the
same as a “character”?
A: That depends. For a programmer, a Unicode code value
represents a single character (for exceptions, see below). For an end
user, it may not. The better word for what end-users think of as
characters is grapheme (as defined in the Unicode glossary): a minimally
distinctive unit of writing in the context of a particular writing
For example, å (A + COMBINING RING or A-RING) is a grapheme
in the Danish writing system, while KA + VIRAMA + TA + VOWEL SIGN U is
one in the Devanagari writing system. Graphemes are not necessarily
combining character sequences, and combining character sequences are not
necessarily graphemes. Moreover, there are a number of other cases where
a user would not count “characters” the same way as a programmer would:
where there are invisible characters such as the RLM used in BIDI,
compatibility composites such as “Dz”, “ij”, or Roman numerals, and so
Q: I would think that certain
characters would have compatibility
decompositions. Why don't they?
A: Many characters such as the following are “confusables”
rather than compatibility characters.
2044 (FRACTION SLASH) → 002F (SOLIDUS)
2010 (HYPHEN) → 002D (HYPHEN-MINUS)
2013 (EN DASH) → 002D (HYPHEN-MINUS)
2014 (EM DASH) → 002D 002D (HYPHEN-MINUS, HYPHEN-MINUS)
They are characters that look
similar, but have distinct behavior and generally distinct appearance
(whether in length or angle). Consult the Unicode Standard for descriptions of the
differences between these characters.
Compatibility characters are really just particular
presentation forms of another character (or sequence of characters),
encoded to ensure round-trip conversion to legacy encodings.
Q: Do all the compatibility ideographs have
A: No, the ideographs FA0E, FA0F, FA11, FA13, FA14, FA1F,
FA21, FA23, FA24, FA27, FA28, and FA29 have no canonical equivalents.
These 12 characters are not duplicates and should be treated as a small
extension of the set of unified ideographs. In fact, they are derived
from industry standards, but are not duplicates of anything. They didn't
make it into the main Unihan block because they aren't in any
preexisting national standard.
Q: Do all the Unicode character
set mappings cover control codes?
A: No, the control code mappings are often omitted from the
tables on the Unicode site. For the ASCII family of character sets,
these are usually one-to-one mappings from the Unicode set based on
taking the lower 8 bits of the Unicode character. However, they may
differ significantly for other sets, such as EBCDIC.
The correct Unicode mappings for the special graphic
characters (01-1F, 7F) of CP437 and other DOS-type code pages are
Q: Is the POSIX ctype.h model sufficient
A: POSIX “ctype.h” knows but two cases, whereas Unicode
knows three. In POSIX, only European Arabic digits can pass “isdigit”,
whereas Unicode has many sets of digits, all putatively equal in their
status as digits. In POSIX
“ctype.h”, that which is “alnum” but not “alpha” must be a “digit”, but
Unicode is aware that not all numbers are digits, nor are all letters
alphabetic. Unicode groks spacing and non-spacing marks, but POSIX
comprehends them not.
Q: How are characters counted when measuring the length or position of a character in a string?
A: Computing the length or position of a "character" in a Unicode string can be a little complicated, as there are four different approaches to doing so, plus the potential confusion caused by combining characters. The correct choice of which counting method to use depends on what is being counted and what the count or position is used for.
Each of the four approaches is illustrated
below with an example string <
U+0061, U+0928, U+093F, U+4E9C, U+10083>. The example string consists
of the Latin small letter a, followed by the Devanagari syllable "ni" (which is represented by the syllable "na" and the combining vowel character "i"), followed by a common Han ideograph, and finally a Linear B ideogram for an "equid" (horse):
1. Bytes: how many bytes (what the C or C++ programming languages call a
char) are used by the in-memory representation
of the string; this is relevant for memory or storage allocation and low-level processing.
Here is how the sample appears in bytes for the encodings UTF-8, UTF-16BE, and UTF-32BE:
61 E0 A4 A8 E0 A4 BF E4 BA 9C F0 90 82 83
00 61 09 28 09 3F 4E 9C D8 00 DC 83
00 00 00 61 00 00 09 28 00 00 09 3F
00 00 4E 9C 00 01 00 83
2. Code units: how many of the code units used by the character encoding form are in the string; this
may be relevant, for example, when declaring the size of a character array or locating the character position in a string. It often represents the "length" of the string in APIs.
Here is how the sample appears in code units for the encodings UTF-8, UTF-16, and UTF-32:
||Code Unit Count
||Code Unit Sequence
61 E0 A4 A8 E0 A4 BF E4 BA 9C F0 90 82 83
0061 0928 093F 4E9C D800 DC83
00000061 00000928 0000093F 00004E9C 00010083
3. Code points: how many Unicode code points—the number of encoded characters—that are in the string. The sample consists of 5 code points (
U+0061, U+0928, U+093F, U+4E9C, U+10083), regardless of character encoding form. Note that this is equivalent to the UTF-32 code unit count.
4. Grapheme clusters: how many of what end users might consider "characters". In this example, the Devanagari syllable "ni" must be composed using a base character "na" (न) followed by a combining vowel for the "i" sound ( ि), although end users see and think of
the combination of the two "नि" as a single unit of text. In this sense,
the example string can be thought of as containing 4 “characters” as end users see them.
A default grapheme cluster is specified in
UAX #29, Unicode Text Segmentation,
as well as in UTS #18,
Unicode Regular Expressions.
The choice of which count to use and when depends on the use of the value, as well as the tradeoffs
between efficiency and comprehension. For example, Java, Windows, and ICU
use UTF-16 code unit counts for low-level string operations, but also
supply higher level APIs for counting bytes, characters, or denoting
boundaries between grapheme clusters, when
circumstances require them. An application might use these to, say, limit user input based on a number of "screen positions" using the user-perceived "character" (grapheme cluster) count. Or the application might have an internal limit based on storage allocation in a database field counted in bytes. This approach allows for efficient low-level
processing, with allowance for higher-level usage. However, for a very
high-level application, such as word-processing macros, grapheme clusters alone
may be sufficient.
Q: Doesn't canonical equivalence mean that
no Unicode-conformant process can treat canonically equivalent sequences
differently in any way?
A: No. That is too strong a statement about canonical
equivalence. Let's take a look at a simple example:
<00C1> a-acute and the sequence <0041 0301> a+combining
acute are canonically equivalent sequences. However, that doesn't mean that “no Unicode-conformant processs should
treat them differently in any way.” A Unicode-conformant process could
declare that it does not interpret combining marks, in which case, for
it, <0041 0301> is a sequence of <0041> plus an uninterpreted character.
And trivially, a Unicode-conformant process allocating a buffer for
character storage clearly has to treat <00C1> and <0041 0301>
differently, since the amount of storage required differs.
Canonical equivalence is supposed to mean that if a
Unicode- conformant process interprets all the code points involved in
the canonical equivalence, it should not insist on an interpretive
difference in the two as constituting some kind of character meaning
difference. Thus, what is non-conformant would be for Process A to hand
Process B <00C1>, i.e. a-acute, for Process B to acknowledge that it got
<0041 0301>, i.e. a-acute, and then for Process A to insist that Process
B is non-conformant. That insistence would itself be non-conformant,
since Process B was within its rights, by virtue of canonical
Q: My language needs a precomposed
character, but only the base character and accent are available in
Is My Character? and the question
needs the digraph “xy”. Why is it not encoded as a single character?.
Q: I can't find the diacritical mark I
need. Unicode contains one that looks the same but has a different
function. Can you add the one I need?
A: Diacritic marks are not encoded by function, and are not
specific to language or usage. For example, look at the acute accent. In some languages, it is a diacritic to indicate a distinct
letter (with a distinct pronunciation); in other languages it marks a
stress, or a quantity; in others it marks a tone. The implications for linguistic
processing (including sorting) may be different in each case. Similarly,
the U+0308 COMBINING DIAERESIS is to be used for diaeresis, trema,
umlaut, as well as other, possibly unrelated uses.
Encoding separate diacritics for each function would have led to
confusion as to which was which in each instance, to user inability to
chose and enter the correct forms, and similar problems. Moreover, if
each function had been encoded, we would have had a legacy problem with
interworking with precomposed letters, as for the ISO 8859 family of
8-bit character sets widespread in European implementations. The letter
“Ö” is simply encoded as 0xF6 in ISO 8859-1 Latin-1 data, regardless of
whether it is being used (in Dutch) as a trema, or (in German, e.g.,
böse) as an umlaut.
Q: Do I always use U+0323 COMBINING DOT BELOW when I need to put a dot under a character?
A: Some combining marks are intended for use with a specific script.
So, for instance, to write a letter in Hindi with a dot below you would use U+0903 DEVANAGARI NUKTA, and to write a pointed letter in Hebrew with a hiriq dot below you would use U+05B4 HEBREW POINT HIRIQ.
In other cases, such as Latin characters with a dot below, you would use
U+0323 COMBINING DOT BELOW.
Q: Unicode doesn't contain the character I
need, which is a Latin letter with a certain diacritical mark. Can you
A: Unicode can already express almost anything you will
ever need in any field of study by using a combination of Latin, IPA, or
other base letters with the various combining diacritical marks. For
example, if you need a highly specialized character such as “Z with
stroke, cedilla, and umlaut”, you can get this combination by using
three existing character codes in combination:
U+01B5 LATIN CAPITAL LETTER Z WITH STROKE
U+0327 COMBINING CEDILLA
U+0308 COMBINING DIAERESIS
With appropriate rendering software, that sequence should
produce a glyph combination like this:
Even if the combination is not available in a particular
font, it is unambiguous and Unicode conformant systems should transmit
and retain the sequence without distortion, and it may be processed
The Navajo-specific question below is also applicable to a wide variety
of similar cases.
Q: Unicode doesn't contain some of the
precomposed characters needed for Navajo and other indigenous languages
of the Americas. Will you add them?
A: The way to encode the various Navajo letters with
diacritics is with the use of combining marks. For example, Navajo
high-toned nasalized vowels:
a + ogonek + acute = <U+0061, U+0328, U+0301>
and so on for the other vowels.
U+0328 is the combining ogonek, and U+0301 is the combining
acute accent. (Navajo orthography uses the ogonek, which is the hook to
the right, for nasalization; that is not the same as the cedilla, which
is the hook to the left. See the difference between U+0119 e-ogonek, and
In Unicode Normalization Form C, the a and the ogonek would
be replaced by the single code for a-ogonek, producing:
a + ogonek + acute → a-ogonek + acute = <U+0105,
i + ogonek + acute → i-ogonek + acute = <U+012F, U+0301>
For display and printing, these combinations should just
show the whole letters, with both accents placed properly. Up-to-date
Microsoft Windows systems (for example) will do that automatically and
correctly for you.
See also the web page
Where Is My Character?
Q: Yes, I can represent (for example) X
with circumflex by use of X with a combining circumflex: <U+0058,
U+0302>. But it doesn't display correctly. The circumflex comes out
misplaced, not properly over the “X”.
A: Your problem is most likely a limitation of the layout
engine and/or font you are using. The real question is
what repertoire of base+accent combinations your
layout engine and fonts are supporting for display. Fonts that
properly support a repertoire with the combination you
need should have the correct display.
If the font doesn't support the repertoire, you can
end up with various glitches in display. Exactly how things
appear in that case will depend on internal details regarding how the font
may handle combining marks.
To compare the possible displays of sequences with those that could have
resulted if X-circumflex had been encoded as a precomposed character, see
the following table.
Some fonts, such as the
fonts, which are freely available for download, contain large
repertoires of appropriate precomposed glyphs for use by linguists and writers
of minority languages. Try checking out those fonts to see if they might
cover your repertoire needs. See also
Q: Just how hard is it for a font designer to
support a sequence like X+circumflex, compared to supporting a
A: With modern font technologies, such as OpenType and AAT, the
difference is relatively small. For example, in OpenType, it is a matter
of adding an entry for the sequence in a ligature table, such as is
discussed in the
VOLT and InDesign Tutorial.
There is no fundamental need for a precomposed character to be encoded
in the standard at all in order for the font to have and display
the correct precomposed glyph for the combination
The hard work, in either case, is in the design for the
precomposed glyph. Conceptually it seems simple enough to add a
precomposed glyph to a font — after all, typically the base glyph will
be in the font already. But professional font design requires considerable
effort. Any time a new accented glyph is added, attention
must be paid to design integrity compared to other accented glyphs,
kerning issues with all other glyphs, and the possible need for yet
other ligatures. Most of this work then has to be repeated for each
face of the font: bold, italics, smallcaps, and their combinations.
The amount of work for testing the font is multiplied many fold, because
not only does the new glyph need testing by
itself, but also in interaction with the other glyphs in the font.
This is the fundamental reason why commercial fonts are relatively
slow to adopt large new collections of precomposed glyphs into
their supported repertoires.
Q: Is there a way for font designers to
provide flexible support for arbitrary accented combinations?
A: Yes, in some cases, modern fonts support
anchors, which enable pretty good dynamic display even of
completely arbitrary base+accent combinations. For example, in
Windows Vista, the core fonts (Arial, Tahoma, Times New Roman,
and a few others) already have this feature.
Other systems, such as Mac OS X, can provide
such dynamic display even in the absence of explicit font support.
Q: Why are new combinations of Latin
letters with diacritical marks not
suitable for addition to Unicode?
A: There are several reasons. First, Unicode encodes many
diacritical marks, and the combinations can already be produced, as
noted in the answers to some questions above. If precomposed equivalents
were added, the number of multiple spellings would be increased, and
decompositions would need to be defined and maintained for them, adding
to the complexity of existing decomposition tables in implementations.
Finally, normalization form NFC (the composed form favored
for use on the Web) is frozen—no new letter combinations can be added
to it. Therefore, the normalized NFC representation of any new
precomposed letters would still use decomposed sequences, which can
already be expressed by combining character sequences in Unicode.
Nothing would be gained by adding the letter with diacritical mark as a
precomposed character; on the contrary, adding such a letter would add
one or more multiple spellings to be reckoned with, incrementally
complicating all Unicode implementations for no net gain.
Q: Is U+034F COMBINING GRAPHEME JOINER a combining mark?
A: Yes. It is not a format control character, but rather a combining mark. It has the General Category value gc=Mn
and the canonical combining class value ccc=0. The presence of a combining grapheme joiner in the midst of a combining character
sequence does not interrupt the combining character sequence.
Q: Does U+034F COMBINING GRAPHEME JOINER affect display of combining character sequences?
A: No. It does not impact cursive joining or ligation (contrast U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER).
And the CGJ does not have any visible display of its own. Of course, as for any such character in the Unicode Standard with no visible
display, it is always possible to use a visible glyph when deliberately showing hidden characters, as for an editor's Show Symbol or Show Hidden mode.
Q: Does U+034F COMBINING GRAPHEME JOINER join graphemes?
A: No. Despite its name, the combining grapheme joiner neither joins graphemes together in the way punctuation might,
nor does it create new graphemes by combinations of other characters. Especially, it cannot be used to
construct grapheme clusters out of arbitrary character sequences,
or extend the scope of subsequent combining characters. It has no impact on line breaking, except that as for other
combining marks, it should be kept with its base when breaking a line.
Q: What is the function of U+034F COMBINING
A: It has several functions: it is used to affect the collation of adjacent characters for purposes of language-sensitive collation,
searching, and matching, and used to distinguish sequences that would otherwise be canonically equivalent.
In collation, the primary function is to prevent contractions from forming. Thus, for example, while “ch” is sorted as a single unit in
a tailored Slovak collation, the sequence <c, CGJ, h> will sort as a “c” followed by an “h”. This usage requires no tailoring of either the combining
grapheme joiner or the sequence. (It is possible to give sequences of characters which include the combining grapheme joiner special tailored weights;
however, such an application of CGJ is not recommended.)
Second, the insertion of a combining grapheme joiner into a sequence of combining marks will block canonical reordering of those combining marks.
This can be used in some unusual circumstances where two sequences of combining marks need to be distinguished, but where the different sequences would be
neutralized by normalization. For example, the sequence of Hebrew points <hiriq, patah> can be distinguished from the sequence
<patah, hiriq> by inserting a
combining grapheme joiner:
<patah, CGJ, hiriq>. The presence of the CGJ would prevent reordering of that sequence to
<hiriq, patah>, thus enabling a reliable
distinction to be maintained. Such usage will also cause differences in collation for the affected sequences.
Q: Unicode doesn't seem to distinguish between tréma and
umlaut, but I need to distinguish. What shall I do?
A. For some purposes, it may be necessary to maintain a distinction
between tréma and umlaut, for example, in bibliographic records kept by the
German library network. For the Latin script, the Unicode Standard does
not distinguish identically appearing diacritical marks with different
functions. Doing so would result in confusion in implementations and among
The character U+034F COMBINING GRAPHEME JOINER (CGJ) may be used to make
the relevant sorting, searching, and data mapping distinctions required for
umlaut versus tréma. The semantics of CGJ are such that it should impact
only searching and sorting, for systems which have been tailored to
distinguish it, while being otherwise ignored in interpretation. The CGJ
character was encoded with this purpose in mind.
The sequences <a, umlaut> and <a, CGJ, umlaut> are not canonically
equivalent. this means that the distinction will not be normalized away on
conversion in and out of bibliographic systems. This eases the
interoperability problem. Both sequences will display as they should.
Implementations which need to distinguish the two for searching and
sorting may systematically maintain weighting distinctions. <a, umlaut> =
<ä> can be treated as equivalent to <a, e> for sorting purposes, while the tréma <a, CGJ, umlaut> can be weighted as a secondary variant of <a> thus
resulting in the desired behavior for such systems. Existing collations
which do not distinguish tréma and umlaut in their data will continue to
work exactly as they currently do, since in default collation tables CGJ is
ignored in weighting.
Existing collation, searching, and matching based on the Unicode Collation
Algorithm will continue to behave as originally specified: they will not
distinguish tréma and umlaut in German data. Only collation tables that
add new weights for the sequence <CGJ, umlaut> will distinguish between
that and a plain umlaut.
Q: Is it possible to apply a diacritic or combining enclosing mark to a sequence of more than one (non-combining) character?
A: No, with the exception of the “double diacritics” deliberately designed to be applied onto a two letter sequence, e.g. U+035D COMBINING DOUBLE BREVE. Neither ZWJ (U+200D ZERO WITDH JOINER) nor CGJ
(U+034F COMBINING GRAPHEME JOINER) “glue” characters together in a way that the
scope of any following combining character would be affected. To get a character
sequence like “Esc” into something like the U+20E3 COMBINING ENCLOSING KEYCAP,
you must resort to higher-level protocols.
Q: I’ve looked through the Unicode code charts and don’t see the combination I need for transliterating Egyptological yod, which is made up of the letter i (/I) plus a half-ring-looking diacritic above it. What should I do?
A: Because the combination of the letter i or I and diacritic is already covered by characters in Unicode, no precomposed characters for Egyptological yod were separately encoded. (See Where Is My Character? My language needs the digraph “xy”. Why is it not encoded as a single character? and other questions above.)
For the diacritic, three choices are available: U+0313 COMBINING COMMA ABOVE, U+0357 COMBINING RIGHT HALF RING ABOVE, or U+0486 COMBINING CYRILLIC PSILI PNEUMATA. The placement of the diacritic is up to the font-designer and rendering engine, so you should test available fonts with the preferred diacritic. (For further information on the set of comma-form and half-ring diacritics in Unicode and their relationships, see Unicode Technical Note #32.)
Q: Should I expect issues, when using this approach for representation of Egyptological yod?
A: Typing the letter Latin i/I and the combining diacritic should work, as long as you have a font with the proper glyphs and a recent computer whose rendering engine can display it. The display may work better with some fonts and on certain platforms. If the display doesn’t work on a webpage, see Display Problems. To help other users get the best results when viewing your webpages, it may be advisable to also include a note on your webpage, identifying which fonts and which browsers provide the best results.
Q: Should I use the combining diacritic over U+0131 LATIN SMALL LETTER DOTLESS I, or should
I use U+0069 LATIN SMALL LETTER I?
A: Rendering of diacritics over i automatically accounts for the removal of the dot on i, so the proper choice is to use LATIN SMALL LETTER I (U+0069).
Q: I am digitizing textual materials for a language whose script contains a small letter "@", as well as a capitalized version, depicted as the letter "A" with a circle around it. Which Unicode characters should I use to represent these?
A: The Unicode Standard does not contain a small letter character for "@", apart from the widely used "at" sign symbol itself, U+0040 COMMERCIAL AT. Nor does it contain a capitalized letter corresponding to the "at" sign symbol. The UTC has declined to encode separate letters for these or to create a case pairing for the existing "at" sign symbol, because of the potential for confusion and/or spoofing involving the "@"—a very common syntax character in email and many other functions.
Such language material could be represented by using the existing circled letter symbols, U+24D0 CIRCLED LATIN SMALL LETTER A and U+24B6 CIRCLED LATIN CAPITAL LETTER A (ⓐⒶ ). These have the advantage of being already encoded and widely available in fonts. Additionally, those two symbols already form a case pair in the standard, which means that case mapping and other casing operations (including case-insensitive searching) involving the digitized material should work correctly. Although the default glyphs for the small circled a and capital circled a in most fonts might not have the optimal appearance, fonts can be adjusted for special purposes such as publication, to produce the desired appearances of the characters.