[Unicode]  The Unicode Standard Home | Site Map | Search
 

The Unicode® Standard: A Technical Introduction

The Unicode Standard is the universal character encoding standard used for representation of text for computer processing. Versions of the Unicode Standard are fully compatible and synchronized with the corresponding versions of  International Standard ISO/IEC 10646. For example, Unicode 7.0 contains all the same characters and code points as ISO/IEC 10646:2012 plus Amd 1 and Amd 2. The Unicode Standard provides additional information about the characters and their use. Any implementation that is conformant to Unicode is also conformant to ISO/IEC 10646.

Unicode provides a consistent way of encoding multilingual plain text and brings order to a chaotic state of affairs that has made it difficult to exchange text files internationally. Computer users who deal with multilingual text—business people, linguists, researchers, scientists, and others—will find that the Unicode Standard greatly simplifies their work. Mathematicians and technicians, who regularly use mathematical symbols and other technical characters, will also find the Unicode Standard valuable.

The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique numeric value and name.

The Unicode Standard and ISO/IEC 10646 support three encoding forms (UTF-8, UTF-16, UTF-32) that use a common repertoire of characters. These encoding forms allow for encoding as many as a million characters. This is sufficient for all known character encoding requirements, including full coverage of all historic scripts of the world, as well as common notational systems.

What Characters Does the Unicode Standard Include?

The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.

The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 7.0 provides codes for 112,956 characters from the world's alphabets, ideograph sets, and symbol collections.

The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are sixteen other supplementary planes available for encoding other characters, with currently over 860,000 unused code points. More characters are under consideration for addition to future versions of the standard.

The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.

Encoding Forms

Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits. 

The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard.

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.  

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is useful where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is  encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

Defining Elements of Text

Written languages are represented by textual elements that are used to create words and sentences. These elements may be letters such as "w" or "M"; characters such as those used in Japanese Hiragana to represent syllables; or ideographs such as those used in Chinese to represent full words or concepts.

The definition of text elements often changes depending on the process handling the text. For example, in historic Spanish language sorting, "ll"; counts as a single text element. However, when Spanish words are typed, "ll" is two separate text elements: "l" and "l".

To avoid deciding what is and is not a text element in different processes, the Unicode Standard defines code elements (commonly called "characters"). A code element is fundamental and useful for computer text processing. For the most part, code elements correspond to the most commonly used text elements. In the case of the Spanish "ll", the Unicode Standard defines each "l" as a separate code element. The task of combining two "l" together for alphabetic sorting is left to the software processing the text.

Text Processing

Computer text handling involves processing and encoding. Consider, for example, a word processor user typing text at a keyboard. The computer's system software receives a message that the user pressed a key combination for "T", which it encodes as U+0054. The word processor stores the number in memory, and also passes it on to the display software responsible for putting the character on the screen. The display software, which may be a window manager or part of the word processor itself, uses the number as an index to find an image of a "T", which it draws on the monitor screen. The process continues as the user types in more characters.

The Unicode Standard directly addresses only the encoding and semantics of text. It addresses no other action performed on the text. For example, the word processor may check the typist's input as it is being entered, and display misspellings with a wavy underline. Or it may insert line breaks when it counts a certain number of characters entered since the last line break. An important principle of the Unicode Standard is that it does not specify how to carry out these processes as long as the character encoding and decoding is performed properly.

Interpreting Characters and Rendering Glyphs

The difference between identifying a code point and rendering it on screen or paper is crucial to understanding the Unicode Standard's role in text processing. The character identified by a Unicode code point is an abstract entity, such as "LATIN CHARACTER CAPITAL A" or "BENGALI DIGIT 5." The mark made on screen or paper—called a glyph—is a visual representation of the character.

The Unicode Standard does not define glyph images. The standard defines how characters are interpreted, not how glyphs are rendered. The software or hardware-rendering engine of a computer is responsible for the appearance of the characters on the screen. The Unicode Standard does not specify the size, shape, nor style of on-screen characters.

Character Sequences

Text elements are encoded as sequences of one or more characters. Certain of these sequences are called combining character sequences, made up of a base letter and one or more combining marks, which are rendered around the base letter (above it, below it, etc.). For example, a sequence of "a" followed by a combining circumflex "^" would be rendered as "â". For more information on how sequences of characters are used to represent text in different languages, see "Where is my Character?", and for information on grapheme clusters (what end-users think of as characters), see UAX #29, Unicode Text Segmentation.

The Unicode Standard specifies the order of characters in a combining character sequence. The base character comes first, followed by one or more non-spacing marks. If there is more than one non-spacing mark, the order in which the non-spacing marks are stored isn't important if the marks don't interact typographically. If they do interact, then their order is important. The Unicode Standard specifies how successive non-spacing characters are applied to a base character, and when the order is significant.

Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposible character). For example, the character "ü" can be encoded as the single code point U+00FC "ü" or as the base character U+0075 "u" followed by the non-spacing character U+0308 "¨". The Unicode Standard encodes precomposed characters for compatibility with established standards such as Latin 1, which includes many precomposed characters such as "ü" and "ñ".

Precomposed characters may be decomposed for consistency or analysis. For example, in alphabetizing (collating) a list of names, the character "ü" may be decomposed into a "u" followed by the non-spacing character "¨". Once the character has been decomposed, it may be easier for the collation to work with the character because it can be processed as a "u" with modifications. This allows easier alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode Standard defines the decompositions for all precomposed characters. It also defines normalization forms to provide for unique representations of characters.

Principles of the Unicode Standard

The Unicode Standard was created by a team of computer professionals, linguists, and scholars to become a worldwide character standard, one easily used for text encoding everywhere. To that end, the Unicode Standard follows a set of fundamental principles:

  • Universal repertoire 
  • Logical order
  • Efficiency
  • Unification
  • Characters, not glyphs
  • Dynamic composition
  • Semantics
  • Stability
  • Plain Text
  • Convertibility

The character sets of many existing international, national and corporate standards are incorporated within the Unicode Standard. For example, its first 256 characters are taken from the widely used Latin-1 character set.

Duplicate encoding of characters is avoided by unifying characters within scripts across languages; characters that are equivalent in form are given a single code. Chinese/Japanese/Korean (CJK) consolidation is achieved by assigning a single code for each ideograph that is common to more than one of these languages. This is instead of providing a separate code for the ideograph each time it appears in a different language. (These three languages share many thousands of identical characters because their ideograph sets evolved from the same source.)

The Unicode Standard specifies an algorithm for the presentation of text with bidirectional behavior, for example, Arabic and English. Characters are stored in logical order. The Unicode Standard includes characters to specify changes in direction when scripts of different directionality are mixed. For all scripts Unicode text is in logical order within the memory representation, corresponding to the order in which text is typed on the keyboard.

Assigning Character Codes

A single number is assigned to each code element defined by the Unicode Standard. Each of these numbers is called a code point and, when referred to in text, is listed in hexadecimal form following the prefix "U+". For example, the code point U+0041 is the hexadecimal number 0041 (equal to the decimal number 65). It represents the character "A" in the Unicode Standard.

Each character is also assigned a unique name that specifies it and no other. For example, U+0041 is assigned the character name "LATIN CAPITAL LETTER A." U+0A1B is assigned the character name "GURMUKHI LETTER CHA." These Unicode names are identical to the ISO/IEC 10646 names for the same characters.

The Unicode Standard groups characters together by scripts in blocks. A script is any system of related characters. The standard retains the order of characters in a source set where possible. When the characters of a script are traditionally arranged in a certain order—alphabetic order, for example—the Unicode Standard arranges them in its codespace using the same order whenever possible. Blocks vary greatly in size. For example, the Cyrillic block does not exceed 256 code points, while the blocks for CJK ideographs contain many thousands of code points.

Code elements are grouped logically throughout the range of code points, called the codespace. The coding starts at U+0000 with the standard ASCII characters, and continues with Greek, Cyrillic, Hebrew, Arabic, Indic and other scripts; then followed by symbols and punctuation. The codespace continues with Hiragana, Katakana, and Bopomofo. The unified Han ideographs are followed by the complete set of modern Hangul. The range of surrogate code points is reserved for use with UTF-16. Towards the end of the BMP is a range of code points reserved for private use, followed by a range of compatibility characters. The compatibility characters are character variants that are encoded only to enable transcoding to earlier standards and old implementations, which made use of them.

A range of code points on the BMP and two very large ranges in the supplementary planes are reserved as private use areas. These code points have no universal meaning, and may be used for characters specific to a program or by a group of users for their own purposes. For example, a group of choreographers may design a set of characters for dance notation and encode the characters using code points in user space. A set of page-layout programs may use the same code points as control codes to position text on the page. The main point of user space is that the Unicode Standard assigns no meaning to these code points, and reserves them as user space, promising never to assign them meaning in the future.

Conformance to the Unicode Standard

The Unicode Standard specifies unambiguous requirements for conformance in terms of the principles and encoding architecture it embodies. A conforming implementation has the following characteristics, as a minimum requirement:

  • characters are from the common repertoire;
  • characters are encoded according to one of the encoding forms;
  • characters are interpreted with Unicode semantics;
  • unassigned codes are not used; and,
  • unknown characters are not corrupted.

Implementations of the Unicode Standard are conformant as long as they follow the rules for the encoding characters into sequences of bytes, words or double words that are in effect for the chosen encoding form and otherwise interpret characters according to the Unicode specification. The full conformance requirements are available within the Latest Version of the Unicode Standard.

Stability

The Unicode Standard has a lot of room to grow, and there are a considerable number of scripts that will be encoded in upcoming versions. This process is strictly additive, in other words, while characters may be added or new character properties may be defined, no characters will be removed—or reinterpreted in incompatible ways. These stability guarantees make it possible to encode data in Unicode and expect that future implementations that conform to a later version of the Unicode Standard will be able to interpret them in the same way as implementations conforming to an earlier version of the standard.

Unicode and ISO/IEC 10646

The Unicode Standard is very closely aligned with the international standard ISO/IEC 10646 (also known as the Universal Character Set, or UCS, for short). Close cooperation and formal liaison between the committees has ensured that all additions to either standard are coordinated and kept in synch, so that the two standards maintain exactly the same character repertoire and encoding.

Version 7.0 of the Unicode Standard is code-for-code identical to ISO/IEC 10646:2012 plus Amd 1 and Amd 2. This code-for-code identity is true for all encoded characters in the two standards, including the East Asian (Han) ideographic characters. Subsequent versions of the Unicode Standard track subsequent editions and amendments to ISO/IEC 10646.

The Unicode encoding forms correspond exactly to forms of use and transformation formats also defined in ISO/IEC 10646. UTF-8 and UTF-16 are defined in Annexes to ISO/IEC 10646. And UTF-32 corresponds to the four-octet form UCS-4 of ISO/IEC 10646.


For Further Information

Authoritative information can be found at Latest Version of the Unicode Standard. That link will lead you to the most recent version of the standard, published on the web. The Unicode Standard Updates and Errata are also posted on this web site.

This web site also contains additional technical material and information on using the Unicode Standard. See the related links in the left hand column.