Variation Sequences
General variation sequences FAQ
Q: What are variation sequences?
A: Every character in Unicode can be displayed
with many different glyphs: An "a" can be displayed with or without the top "hook" (a versus ɑ).
A not-equals sign (≠) can be displayed with an angled or vertical slash, and so on.
In some situations, however, it is important to
indicate in plain text that only a subset of the possible glyphs for a character should be
used, such as a vertical slash for ≠. The variation sequences are a standardized mechanism
for requesting such an appearance.
Q: What is the structure of a variation sequence?
A: Variation sequences consist of a base character followed by a variation selector.
Q: What variation sequences are valid?
A: Only those listed in
StandardizedVariants.txt,
emoji-variation-sequences.txt,
or the registered sequences listed in the
Ideographic Variation Database (IVD).
Q: What's the difference between standardized variation sequences and registered Ideographic Variation Sequences (IVSes)?
A: There is no difference in how these two types of variation sequences are used or
supported by implementations.
Standardized sequences, as the name implies, are defined in the Unicode Standard, as listed in
StandardizedVariants.txt.
The variation sequences which affect emoji presentation are listed in
emoji-variation-sequences.txt,
and are documented in UTS #51, Unicode Emoji.
IVS are formally registered by their submitters according to the procedures listed in
UTS #37.
They are listed in the Ideographic Variation Database (IVD).
After an IVS has been registered, it can be used by anyone.
Q: Can I define my own sequences?
A: No, that is the equivalent of trying to define an unassigned code point to be your own character. Private use
characters should be used instead.
Q: When are variation sequences not appropriate?
A: Variation sequences are inappropriate if two different
shapes of a character carry very distinct meaning. This was the case for IPA,
where a separate character was encoded for the hooked “g” (U+0261) and also for the “ɑ”
that doesn't have a handle (U+0251), instead of defining variation sequences to represent the different glyphic presentations.
Q: Can all glyph variations be represented with variation sequences?
A: No. The existing situation in Unicode sometimes requires a
font to follow particular conventions to be useful. For example, any font for IPA must display the "a" at U+0061 with a
handle glyph, not a bowl glyph. Otherwise, it would be impossible to express
the distinction from the IPA character with the bowl a. Same for the “g”.
Similar considerations apply to mathematical fonts and Greek
characters, where some forms of beta, theta, phi, etc. have been encoded
for mathematical purposes, so the font must supply the “other” shape
for the regular character.
Q: How are the glyphs for variation sequences described?
A: A standardized variation sequence, such as <222A,
FE00>, associates a sequence with a description, such as “UNION with
serifs”. Here, “with serifs” indicates that the presence of serifs
distinguishes the glyph variant from the ordinary glyph (which does not
have serifs). In this case with a mathematical operator, the form without
serifs would be predominant. There are other cases, where glyph variants
occur more equally—in those cases, it would be problematic to assign
only one of them a variation sequence, as the other one isn't
necessarily a “default.”
The appearance of the variant glyph is not as tightly
restricted as the design of a logo, for example. It still can vary in
all aspects, except that it is expected to retain its distinguishing
characteristic—and it should remain a recognizable glyph for the
character.
In order to standardize a variation sequence, the variant
glyph at a minimum needs to be identified and described. It should also
be applicable generically, not restricted to a single font, such
as the many stylistic variations of the ampersand only found in Poetica
Ampersand. [AF]
Q: What about positional forms?
A: Some characters adopt different shapes
depending on the characters around them. These are called positional forms.
Unlike “random” stylistic variations, these are standard forms for these characters,
in the sense that a reader can look at the shape and say “this is
the final form of character xxx.”
Where the display of positional
forms is predictable, such as in Arabic, variation sequences are
not necessary. In cases where positional variants need to be displayed
outside their normal context, this rendering can be handled with two special
characters ZWJ (zero width joiner) and ZWNJ (zero width non joiner),
instead of variation sequences. Mongolian is more complex, so there are special variation selectors
for it. For more information, see Section 13.5, Mongolian in The Unicode Standard.
Q: Are some positional forms encoded as separate characters?
A: Yes. In Greek, the small sigma has
a special form, which is used at the end of words. It was given an explicit
character code in early Greek character sets, so Unicode continued this
practice. In the Latin script, the contrast between “long s” and regular
(round) “s” is in some sense positional, but the rules are not easy to
automate, and even then exceptions would apply. Therefore, again, an
explicit character was encoded. Similar characters are encoded for Hebrew.
Variation sequence display and support FAQ
Q: How should variation sequences be displayed?
A: When they are valid variation sequences, they should be displayed as
illustrated in the Unicode code charts,
the emoji charts,
or in the Ideographic Variation Database.
When a variation sequence is not valid or its display is not supported,
the base character is displayed as usual, and the variation selector
is invisible. See Display of Unsupported Characters.
Q:What about applications that don't support variation sequences?
A: Applications not supporting variation sequences should act as if the
variation selector is not present. That normally applies to all text
processes such as searching, sorting, parsing, and so forth.
Q: How can variation sequences be handled in fonts?
A: For handling variation sequences with OpenType fonts, see
“Format 14: Unicode Variation Sequences” in the OpenType specification.
The following font development tools are helpful for implementing and verifying variation sequences in OpenType fonts via the Format 14 'cmap' subtable:
A significant number of OpenType fonts now support variation sequences. Please consult the font's documentation to determine the extent to which variation sequences are supported.
Q: What changes does a browser developer need to make to support variation sequences?
A: Browsers generally use a font fallback mechanism to display web pages. This allows users to read text when the font specified in the web page is unavailable or doesn't support all the characters that are referenced on that web page. A simple but insufficient mechanism is to display characters in a font up until the first character that can't be displayed. Such a mechanism fails with variation sequences. A better mechanism is to treat a combining character sequence as a single entity for the purpose of font substitution. Because variation selectors have the General_Category property value of Nonspacing_Mark, this treatment allows variation sequences to be handled correctly. This applies more generally, to developers of any OS or application, and not only to browser developers.
Q: How should variation sequences be handled in search?
A: There are a number of different methods. The first and simplest method is to ignore any variation selectors when doing a search. Another method is to have a query without variation selectors match terms with any variation selectors, but a query with a specific variation selector will only match a term with that variation selector. Thus:
Q: How should variation sequences be handled in IMEs (input method editors for CJK)?
A: They can be listed as separate options to choose from, just like single code
points. However, if there are many options it may be worth having a
pull-down or fly-out menu associated with the base character.
Standardized variation sequences FAQ
Q: How can I propose a standardized variation sequence?
A: You can initiate the process of requesting a variation sequence by submitting an inquiry via the contact form. A thorough understanding of how Variation Selectors are used will make a proposal more likely to be accepted by the UTC. Read Section 23.4, Variation Selectors, UTR #25 and UAX #34, as well as the rest of this FAQ for background information. [AF]
Q: I'm proposing an addition to a
historic script that is a variant of an existing character. Should I
propose it as a new character or as a new variation sequence?
A: Variation sequences provide a means to
specify a certain significant glyphic variation of a character, without
encoding each variation as a separate character. This is particularly
useful whenever such distinction is not universally necessary.
Because the character itself is part of the variation
sequence, one should be able to search and find all the instances of
that particular character, independent of variation in its appearance, a
task which would be more complicated if the variants were encoded as
separate characters. If you can replace the variant by the existing
character without significantly distorting the content of the text,
then a variation sequence is the appropriate way to represent the variant, and you
should propose your addition as a variation sequence.
For historic scripts, the variation sequence provides a useful tool, because it can show mistaken or nonce glyphs and relate them to the base character.
It can also be used to reflect the views of scholars, who may
see the relation between the glyphs and base characters differently.
Also, new variation sequences can be added for new variant appearances
(and their relation to the base characters) as more evidence is
discovered.
Q: In what situations does Unicode define variation sequences?
A: Standardized variation sequences are intended as an exceptional mechanism to deal with certain difficult edge-cases where the character versus glyph question cannot be decided. To qualify as a standardized variant an entity must clearly be the same character, in most cases. In most contexts that means that substituting the base character is not only harmless to the meaning of the text, but ideally not even noticeable by many readers. [AF]
Ideographic Variation Sequence (IVS) FAQ
Q: How can I register an IVS?
A: Registrations are subject to the requirements and process specified in UTS #37. [AF]
Q: Can Ideographic Variation Sequences (IVSes) be registered for non-variant or standard forms of CJK Unified Ideographs?
A: Yes. The Han Unification process that resulted in the standard repertoire of CJK Unified Ideographs treats all glyphic forms that can be represented by a CJK Unified Ideograph as variants. By definition, therefore, the glyph chosen as the standard representation of a CJK Unified Ideograph is itself a variant. To emphasize this point, starting with Unicode 5.2, the Unicode code charts no longer show a single glyph, but instead typically show several national variants for each CJK Unified Ideograph. The purpose of a registered IVS is to allow one to pin down a CJK Unified Ideograph to a more specific glyphic form, regardless of whether that glyph is commonly considered a variant or non-variant form.