|Authors||Ken Whistler (firstname.lastname@example.org), Mark Davis (email@example.com)|
This document clarifies a number of the terms used to describe character encodings, and where the different forms of Unicode fit in. The document is in initial phase, and has not gone through the editing process. We welcome review feedback and suggestions on the content.
Status of this document
This document is an unpublished, preliminary working draft. It is posted for general review. At its next meeting, the Unicode Technical Committee (UTC) may reject this document, review it for suitability to progress to draft status and/ or further amend this document. Please mail any comments to the authors.
There are a number of inconsistencies and misunderstandings about just what Unicode is in the context of character encodings of all types. These have been highlighted by the recent discussions about the process of registering "UTF-16BE" and "UTF-16LE" as IANA charsets for the Internet, as well as editorial problems resulting from the attempt to treat UTF-16 and UTF-8 uniformly in the revision of the text for the Unicode Standard, Version 3.0.
The main body of this document consists of an attempt at detailed definition of several terms related to character encoding. This section merely clarifies acronyms and a few other subsidiary terms used in various contexts.
The character encoding model proposed here draws on the character architecture promoted by the IAB for use on the Internet. It also draws in part on the CRDA used by IBM for organizing and cataloging its own vendor-specific array of character encodings. The focus here is on clarifying how these models should be extended and clarified to cover the needs of the Unicode Standard and ISO/IEC 10646.
The IAB model makes three distinctions with respect to level: Coded Character Set (CCS), Character Encoding Scheme (CES), and Transfer Encoding Syntax (TES). However, to adequately cover the distinctions required for the character encoding model, five levels need to be defined. One of these, the repertoire, is implicit in the IAB model. The other is an additional level between the CCS and the CES.
The five levels can be summarized as:
A repertoire is defined as the set of abstract characters to be encoded.
Repertoires are unordered sets that come in two types: fixed and open. For most character encodings, the repertoire is fixed (and often small). Once the repertoire is decided upon, it is never changed. Addition of a new abstract character to a given repertoire is conceived of as creating a new repertoire, which then will be given its own catalogue number, constituting a new object.
For the Unicode Standard, on the other hand, the repertoire is inherently open. Because Unicode is intended to be the universal encoding, any abstract character that ever could be encoded is potentially a member of the actual set to be encoded, whether we currently know of that character or not.
Microsoft, for its Windows character sets, also makes use of a limited notion of open repertoires. The repertoires for particular character sets are periodically extended by adding a handful of characters to an existing repertoire. This recently occurred when the EURO SIGN was added to the repertoire for a number of Windows character sets, for example.
The Unicode Standard versions its repertoire by publication of major and minor editions of the standard: 1.0, 1.1, 2.0, 2.1, 3.0, ... The repertoire for each version is defined by the enumeration of abstract characters included in that version. There was a major glitch between versions 1.0 and 1.1, occasioned by the merger with ISO/IEC 10646, but starting with version 1.1 and continuing forward indefinitely into future versions, no character once included is ever removed from the repertoire. (There are three-level versions of the Unicode character database, such as 2.1.5. These versions do not differ in character repertoire, but may amend character properties and behavior.)
ISO/IEC 10646 has a different mechanism of extending its repertoire. The 10646 repertoire is extended by a formal amendment process. As each individual amendment is ballotted, approved, and published, that may constitute an extension to the 10646 repertoire, depending on the content of the amendment. The tricky part about keeping the repertoires of the Unicode Standard and of ISO/IEC 10646 in alignment is coordinating the publication of major versions of the Unicode Standard with publication of a well-defined list of amendments for 10646 (or a major revision and republication of 10646).
Repertoires are the things that in the IBM CDRA architecture get CS ("character set") values.
Unlike most character repertoires, Unicode/10646 is deliberately intended to be universal in coverage. What this implies in practice, given the complexity of many writing systems, is that nearly all implementations will implement some subset of the total repertoire, rather than all the characters.
Formal subset mechanisms are occasionally seen in implementations of some Asian character sets, where for example, the distinction between "Level 1 JIS" and "Level 2 JIS" support refers to particular parts of the repertoire of the JIS X 0208 kanji characters to be included in the implementation.
However, subsetting is a major formal aspect of ISO/IEC 10646-1. The standard includes a set of internal catalog numbers for named subsets, and further makes a distinction between subsets that are "fixed collections" and open collections that are defined by a range of code positions. (See Technical Corrigendum No. 2 to ISO/IEC 10646-1:1993(E) for details.) The collections that are defined by a range of code positions are themselves open subsets of the repertoire, since they could be extended at any time by an addition to the repertoire which happens to get encoded in a code position between the range limits which define such a collection.
The current TC304 effort to define multilingual European subsets (MES-1, MES-2, and MES-3) of ISO/IEC 10646-1 is a CEN effort to define three more subsets (each a fixed collection) that will, no doubt, at some point be added as named subsets in 10646.
For the Unicode Standard, subsets are nowhere formally defined. It is considered up to the implementation to define and support the subset of the universal repertoire that it wishes to interpret.
A coded character set is defined to be a mapping from a set of abstract characters to the set of nonnegative integers.
Note: Mathematically, this mapping may not be 1:1. For example, katakana ka is a single abstract character, but it has two representations in both Unicode and in SJIS. Also, the range of integers used for the mapping need not be contiguous.
An abstract character is defined to be in a coded character set if the coded character set maps from it to an integer. That integer is said to be the value (or coded value) of the abstract character.
Effectively, coded character sets are the basic object that both ISO and vendor character encoding committees produce. They relate a defined repertoire to nonnegative integers, which then can be used unambiguously to refer to particular abstract characters from the repertoire.
The Unicode 2.0 concept of the Unicode scalar value (cf. D28, page 3-7 of the Unicode Standard, Version 2.0) is explicitly this nonnegative integer used for mapping of the Unicode repertoire.
A coded character set may also be known as a character encoding, a coded character repertoire, a character set definition, and a code page.
The IBM CDRA architecture get CP ("code page") values refer to coded character sets. (Note that this use of the term code page is quite precise and limited. It should not be--but generally is--confused with the generic use of code page to refer to character encoding schemes. See below.)
In the JTC1/SC2 context, coded character sets also require the assignment of unique character names to each abstract character in the repertoire to be encoded. This practice is not generally followed in vendor coded character sets or the encodings produced by standards committees outside SC2, where the names provided for characters, if any, are often variable and annotative, rather than normative parts of the character encoding.
The main rationale for the SC2 practice of character naming was to provide a mechanism to unambiguously identify abstract characters across different repertoires given different mappings to integers in different coded character sets. Thus LATIN SMALL LETTER A WITH GRAVE would be seen as the same abstract character, even when it occurred in different repertoires and was assigned different integers, depending on the particular coded character set.
This functionality of ensuring character identity across different coded character sets (or "code pages") is handled in the IBM CDRA model instead by assigning a catalogue number, known as a GCGID (graphic character glyphic identifier), to every abstract character used in any of the repertoires accounted for by the CDRA. Abstract characters that have the same GCGID in two different coded character sets are by definition the same character. Other vendors have made use of similar internal identifier systems for abstract characters.
The advent of Unicode/10646 has largely rendered such schemes obsolete. The identity of abstract characters in all other coded character sets is increasingly being defined by reference to Unicode/10646 itself. Part of the pressure to include every "character" from every existing coded character set into Unicode results from the desire by many to get rid of subsidiary mechanisms for tracking bits and pieces, odds and ends that arent part of Unicode, and instead just make use of Unicode as the universal catalog of characters.
The range of nonnegative integers used for the mapping of abstract characters defines a related concept of code space. Traditional boundaries for types of code spaces are closely tied to the encoding forms (see below), since the mappings of abstract characters to nonnegative integers are not done arbitrarily, but with particular encoding forms in mind. Example of significant code spaces are 0..7F, 0..FF, 0..FFFF, 0..10FFFF, 0..7FFFFFFF, 0..FFFFFFFF.
Code spaces can also have fairly elaborated structures, depending on whether the range of integers is conceived of as continuous, or whether particular ranges of values are disallowed. Most complications again result from considerations of encoding form; when an encoding form specifies that the integers used in encoding are to be realized as sequences of octets, there are often constraints placed on the particular values that those octets may have mostly to avoid control code values. Expressed back in terms of code space, this results in multiple ranges of integers that are disallowed for mapping a character repertoire. (See Ken Lundes publications on Asian information processing to see two-dimensional diagrams of typical code spaces for Asian coded character sets.)
A character encoding form is a datatype-specific width specification of each of the integers used in a CCS.
Another way of putting this is that the encoding form enables a character representation as actual data in a computer.
A datatype is an integer occupying a certain binary width in a computer architecture, such as an 8-bit byte.
A character encoding form is defined to be a mapping from abstract characters to sequences of the same datatype. The sequences do not necessarily have the same length.
An abstract character is said to be in a character encoding form if the character encoding form maps it to a datatype sequence. That sequence is said to be the datatype-specified value of the abstract character, and also is known as an encoded character.
A character encoding form for a coded character set is defined to be a character encoding form for all of the abstract characters in the coded character set, and whose datatype-specified values can be algorithmically generated from the values of the coded character set.
Note: In many cases, there is only one character encoding form for a given coded character set. In some such cases only the character encoding form has been specified. This leaves the coded character set implicitly defined, based on an implicit relation between the datatype sequences and integers.
The encoding form may result in either fixed-width or variable-width collections of datatypes associated with abstract characters. The encoding form may involve an arbitrary functional mapping (reversible and algorithmic) of the integers of the CCS to a set of datatype sequences.
Encoding forms come in various types. Some of them are exclusive to the Unicode/10646, whereas others represent general patterns that are repeated over and over for hundreds of coded character sets. Here are of some of the more important examples of encoding forms.
Examples of fixed-width encoding forms:
Examples of variable-width encoding forms:
Note that it is at the level of an encoding form that most APIs must be specified, since it is here that characters are actually bound to datatypes. This is the fundamental difference between UTF-8 and UTF-16, which cannot coexist amicably for the same textual API (at least without playing type-switching tricks in the API); otherwise they represent exactly the same coded character set. However, the byte order of the platform is generally not relevant at the API level; the same API can be compiled on platforms with any byte polarity, and will simply expect character data (as for any integral-based data) to be passed to the API in the byte polarity for that platform.
The encoding form also defines one of the fundamental relations that internationalized software cares about: how many datatypes are there for each character. This used to be expressed in terms of how many bytes each character was represented by. With the introduction of UCS-2, UCS-4, and UTF-16, with wider datatypes for Unicode and 10646, we must now generalize this to two pieces of information: a specification of the width of the fundamental datatype used for representing character data, and the number of datatypes are used to represent each character.
UTF-8 provides a good example:
0x00..0x7F ==> 1 byte 0x80..0x3FF ==> 2 bytes 0x400..0xD7FF, 0xE000..0xFFFF ==> 3 bytes 0x10000 .. 0x10FFFF ==> 4 bytes
Examples of encoding schemes as applied to particular coded character sets:
A character encoding scheme is a mapping of a sequence of abstract characters from one or more CCSs, each using a defined encoding form, into serialized byte sequences.
Character encoding schemes are the things that in the IAB architecture get IANA charset identifiers. The important thing, from the IANA charset point of view is that a sequence of encoded characters must be unambiguously mapped onto a sequence of bytes by the charset. The charset (= CES) must be specified in all instances, as in Internet protocols, where textual content is treated as a ordered sequence of bytes, and where the textual content must be reconstructible from that sequence of bytes.
Character encoding schemes are the things that in the IBM CDRA architecture get CCSID (coded character set identifier) values.
A character encoding scheme may also be known as a charset, a character set, or a code page (broadly construed).
Character encoding schemes are also relevant to the issue of cross-platform persistent data involving datatypes wider than a byte, where byte-swapping may be required to put data into the byte polarity canonical for a particular platform.
Most fixed-width byte-oriented encoding forms have a trivial mapping into a CES: each 7-bit or 8-bit quantity maps to a byte of the same value.
Most mixed-width byte-oriented encoding forms also simply serialize the sequence of CC-data-elements to bytes. UTF-8, since it is already a byte-oriented encoding form, follows this pattern. UTF-16, on the other hand, which involves 16-bit quantities must specify byte-order for the byte serialization. This is the difference between UTF-16BE, where the two bytes of the 16-bit quantity are serialized in big-endian order and UTF-16LE, where they are serialized in little-endian order.
Character encoding schemes may also partake of some of the features of transfer encoding syntaxes proper (see below). Thus both UTF-8 and UTF-7 are designed to be byte-oriented in their datatype and to avoid control code values for transmission and other protocols. UTF-7 goes further in incorporating some of the features of Base64 to avoid a number of byte values in the ASCII range. On the other hand, the Unicode-specific compression schemes that convert directly from Unicode data in a specified encoding form to a sequence of bytes that compresses the textual data, can also be conceived of as a character encoding scheme.
The important differences between a CES and an Encoding Form are:
A transfer encoding syntax is a reversible transform of encoded data which may (or may not) include textual data represented in one or more CES's.
Note: A more appropriate term for this might be Transfer Encoding Form, but Transfer Encoding Syntax already has widespread usage in the Internet community.
Typically TESs are engineered either to:
The Internet Content-Transfer-Encoding tags "7bit" and "8bit" are special cases. These are data width specifications relevant basically to mail protocols and which appear to predate true TESs like quoted-printable. Encountering a "7bit" tag doesnt imply any actual transform of data; it merely is an indication that the charset of the data can be represented in 7 bits, and will pass 7-bit channels it is really an indication of the encoding form. In contrast, quoted-printable actually does a conversion of various characters (including some ASCII) to forms like "=2D", "=20", etc., and should be reversed on receipt to regenerate legible text in the designated character encoding scheme.