In a new version of the Unicode Standard, the Unicode Consortium may add characters or make certain changes in characters that were encoded in a previous version of the standard. There are, however, limitations imposed by the consortium on the changes that can be made. These have important implications for implementors in anticipating the kinds of changes in future versions of the standard.
This ensures that implementors can always depend on each version of the
Unicode Standard being a superset of the previous version.
The character names are used to distinguish characters from other
characters, and do not always express the full meaning of the character.
They are designed to be used programmatically, and thus must be stable. In
some cases the original name chosen to represent the character is inaccurate
in one way or another. Any such problems can be dealt with by adding
annotations to the character. Organizations may also wish to produce
translated names for the characters, to make the information conveyed by the
name accessible to non-English speakers.
This means that given a string that only contains characters from version
X of the Unicode Standard, once put into a normalization form, will also be
in that normalization form in any future version of the Unicode Standard.
Further description of these is provided in described in UnicodeData.html
- The General Category values will not be further subdivided.
- The Bidi Category values will not be further subdivided.
- Combining classes are limited to the values 0 to 255.
- All characters other than those of General Category M* have the combining class 0.
- Canonical and Compatibility mappings are always in canonical order, and the resulting recursive decomposition will also be in canonical order.
- Canonical mappings are always limited either to a single value, or to a pair. The second character in the pair cannot itself have a canonical mapping.
The consortium will endeavor to keep these properties as stable as possible, but some circumstances may arise that require changing them. In particular, as Unicode encodes less-well documented scripts (such as for minority languages in Thailand) the exact character properties may not be known at the time the script is encoded.
- General Category
- Case Mappings
- Bidi Properties
- The type of compatibility decomposition (e.g. <font> vs. <compat>)
- etc.