[Unicode] Unicode Emoji Tech Site | Site Map | Search
 

Emoji Encoding Principles


Introduction

Emoji are very different from other Unicode characters. For many years, they were considered out of scope for encoding as characters. Only in 2007 did the UTC agree to broaden the scope to provide for encoding of emoji, to allow for compatibility with Japanese carrier standards which were coming into widespread use for message exchange. This resulted in the release of the first major addition of emoji characters in Unicode 6.0 (2010), and since then they have exploded in popularity. Yet emoji remain quite different in kind than other Unicode symbols, which is reflected in fundamental differences in the way that new ones are encoded. It is important to understand those differences.

For more information about the history of emoji and their encoding, see the Introduction to UTS #51: Unicode Emoji.

Special Considerations for Emoji

For the majority of characters that are not emoji, the UTC looks for evidence of existing usage as text. Proposers need to establish that there is some reasonable body of text, either modern or historic, that uses that character. They also need to establish that the usage is not simple glyph variation: we don’t encode different characters for ordinary stylistic variants like {A A A A A A} or {a a}, except where they are used as distinct symbols in some notation. In theory, this makes for a relatively closed set of possible characters. A detailed character proposal must be supplied, following the directions at Submitting Character Proposals.

Emoji are very different from most other characters including regular pictographic characters. They are colorful, playful representations of persons, places, or things — and combinations of those (such as a person riding a bicycle). For emoji, rather than look for evidence of existing textual use — since emoji effectively cannot exist in text until they are encoded — we look for evidence of likely high usage once they are encoded, plus a number of other factors. As with non emoji characters, a detailed proposal must be supplied, following the directions at Submitting Emoji Proposals. However, that proposal is very different from the regular character proposal, reflecting the differences between the two processes.

In addition, in order to handle features such as skin tone and gender, emoji sometimes have complex sequence structure, which is generally not an issue for non-emoji pictographic characters.

Limitations on Emoji Encoding

Emoji are effectively unlimited in variety. In theory, one could have emoji for 339 breeds of dogs, or 10,000 species of birds, and even variants of those (a large female Welsh Harlequin duck, looking over its right shoulder with an egg in the foreground). In practice there are many limitations — just not the same limitations as for other Unicode symbols:

  1. In order to be useful emoji need to be widely deployed by major vendors. If not, there is no desire or ability to burden Unicode with large numbers of pictographic symbols that are not ever “emojified.”
  2. The major vendors have indicated that they want to hold to an emoji “budget” each year of about 70 new characters (and limits on emoji sequences as well). Each additional emoji can be a burden on memory, UI usability and development cost — the memory impact is especially important for mobile devices in emerging markets.
  3. There is always the option of using emoji-style images (a.k.a. stickers) for more specific objects. That is another reason to keep to an emoji budget; every Unicode character is encoded forever, and if emoji go out of style, there is no desire to have an excessive number of them. Emoji cannot provide for all the different ways that people can identify themselves; but services are providing GIFs and stickers to fill that gap. For example, in GBoard you can search GIFs for “rainbow flag” or “Assyrian flag” and insert them. You can also construct images using tools such as Emoji Minis or Animoji.
  4. The process for encoding emoji needs to balance a number of factors. (See Submitting Emoji Proposals.) High among those is prospective usage — if a proposed emoji is not going to be used often by millions of people, then it is taking a slot in the budget that could be occupied by a more popular emoji. Another important feature is breadth: when there are multiple variants of an emoji, the usage just tends to be split among them, while a new kind of emoji permits new kinds of expression.
  5. The Unicode Consortium also tends to roll out small initial sets of new types of characters, such as gender-neutral forms, so that it can assess the frequency of usage before adding more of that type.
  6. The Unicode Consortium has developed a submission process that is open to anyone, developing factors for encoding that can be applied as objectively as possible to each proposal. Those factors are also applied to internal proposals from Unicode members, and to proposals from liaison members. The process is open to improvement; the Emoji Subcommittee welcomes proposals for improvements.

Non-emoji pictographic characters are typically limited to sets (such as Dingbats) that were encoded for compatibility, or for specialized domains such as math symbols or alchemical symbols. New non-emoji pictographic characters are subject to the same process as new letters or other symbols: demonstrated use as plain text characters in some body of literature, per Submitting Character Proposals.

Unicode is not open to all possible graphic images as non-emoji pictographs. The Unicode Consortium doesn’t approve non-emoji pictographic characters simply to fill in perceived gaps, such as fleshing out a complete taxonomic classification of animal species or varieties.

Uniqueness and Stability of Representation

A given emoji may have multiple valid encoded representations. However, there is only one representation that is “recommended for general interchange” (RGI). For example, an emoji flag for American Samoa has two valid representations:

  1. Using an Emoji Flag Sequence with a pair of REGIONAL INDICATOR characters that indicate the Unicode region subtag for American Samoa: “AS”.
  2. Using a Flag Emoji Tag Sequence with TAG characters indicating the Unicode subdivision id for American Samoa: “usas”.

However, only the former is RGI and is listed in the emoji data files. In general, of the possible valid representations of an emoji, the shortest is usually chosen as the recommended (RGI) form.