[Unicode]   Unicode Home Page | Unicode Technical Reports | Versions of the Unicode Standard | Character Charts

 L2/01-081

{Proposed Draft} Unicode Technical Report #23

SURVEY OF UNICODE CHARACTER PROPERTIES AND GUIDELINES

Revision 0d10
Author Asmus Freytag (asmus@unicode.org)
Date 2001-01-29
This Version http://www.unicode.org/unicode/reports/tr23-0d10.html
Previous Version none
Latest Version none

Summary

This report presents a survey of the character properties defined in the Unicode Standard as well as guidelines to their usage.

Status of this document

This is the tenth working draft of the author. It is the result of the discussion of a prior draft at UTC#85 as well as merging with chapter 4 from Unicode 3.0 (but without the long listings).

[Editorial notes for the benefit of reviewers are indicated like this.]

Contents

1 Scope
2 Overview
2.1 Origin of Character Properties
2.2 Character Behavior in Context
2.3 Relation of Character Properties to Algorithms
2.4 Normative Properties
2.5 Informative Properties
2.6 Issues
3 Definitions
4 Conformance
4.1 Conformance Requirements
4.2 Overriding Properties and Higher-level Protocols
4.3 Normative and Informative Properties
5 Table of Character Properties
6 Notes on specific Properties
7 Updating Character Properties and Extending the Standard
7.1 Updating
7.2 Guarantees
8 Special Property Values
8.1 N/A
8.2 Default Values
8.3 Undetermined Property Values
8.4 Preliminary Property Assignments
9 Working with properties
9.1 Subsection 2.1
9.2 Subsection 2.2
9.3 Subsection 2.1
9.4 Subsection 2.2
10 Data Managagement and Distribution
10.1 Versions
10.2 File Syntax Conventions
10.3 Beta
11. References
Acknowledgements
Revisions

1. Scope

This survey collects summary information about all Unicode Characters Properties, their definition and usage in a single location. It also provides discussion of common aspects of character properties. This survey is not intended to supersede chapter 4 in the book, nor the existing body of technical reports that provide detailed descriptions for particular character properties. Instead it presents a capsule summary only and references the corresponding technical report for the full details.

This report specifically covers formal character properties, which are those attributes of characters that are specified according to the definitions set forth in this report. 

2. Overview

2.1 Origin of Character Properties

The Unicode Standard views character semantics as inherent to the definition of a character and conformant processes are required to take these into account when interpreting characters. The assignment of character semantics for the Unicode Standard is based on character behavior. For other character set standards, it is left to the implementer, or to unrelated secondary standards, to assign character semantics to characters. In contrast, the Unicode  Standard supplies a rich set of character attributes and properties for each character contained in it.  Many properties are specified in relation to processes or algorithms that interpret them, in order to implement the discovered character behavior. 

2.2 Character Behavior in Context

The interpretation of some properties (such as the case of a character) is independent of context, whereas the interpretation of others (such as directionality) is applicable to a character sequence as a whole, rather than to the individual characters that compose the sequence.

Other examples that require context include the classification of neutrals in script assignments or title casing. The line breaking rules of TR#14 involve character pairs and and triples, and in certain cases, longer sequences. The glyph(s) defined by a combining character sequence are the result of contextual analysis in the display shaping engine. Isolated character properties typically only tell part of the story.

2.3 Relation of Character Properties to Algorithms

When modeling character behavior with computer processes, formal character properties are assigned in order to achieve the expected results. Such modeling depends heavily on algorithms. In some cases, a given character property is specified in close conjunction with a detailed specification of an algorithm. In other cases, algorithms are implied but not specified, or there are several algorithms can make use of the same general character property. The last case may require occasional differences in character property assignment to make all algorithms work correctly. This can usually be achieved by overrides.

While it may be tempting to assign somewhat arbitrary propertiy values, as long as the algorithm happens to produce the expected results, proceeding in this way hides the nature of the character and limits the re-use of character properties by related processes. Instead of tweaking the properties to simply make a particular algorithm easier, careful attention needs to be paid to the underlying essential linguistic identity of the character. However, not all aspects of a characters identity are relevant in all circumstances, and some characters can be used in many different ways, depending on context or circumstance. Therefore the formal character properties alone are not sufficient to describe the complete range of desirable or acceptable behaviors.

2.4 Normative Properties

As specified in Chapter 3, Conformance, the Unicode Standard [UNICODE] defines both normative and informative properties.

Normative Properties. Normative means that implementations that claim conformance to the Unicode Standard (at a particular version) and that make use of a particular property must follow the specifications of the standard for that property to be conformant. The term normative when applied to a character property does not mean that the value of the property will never change. Corrections and extensions to the standard in the future may require minor changes to normative values, even though the Unicode Technical Committee strives to minimize such changes.

By making a property normative, the Unicode Standard guarantees that conformant implementations can rely on the fact that other conformant will interpret the character in the same way. This is most useful for those properties where the Unicode Standard provides precise rules for the interpretation of characters based on their properties. An example are the bidirectional properties and their use by the bidirectional algorithm. For some character properties, for example the general category, the Unicode standard does note define what model of processing it is intended to support and what the required consequences are of a character being e.g. "Letter Other" as opposed to "Symbol Other". In the absence of such definition, the only effect of conformance that can be tested in a strict manner is whether a character property library returns the correct value to its caller.

Note: one trivial, but important instance of conformant implementation is runtime access to a character property database. For normative properties, conformant implementations guarantee that the returned values match the values defined by the Unicode Consortium.

2.5 Informative Properties

Informative Properties. An informative character property is strongly recommended, but a conformant implementation is free to use or change such values as it may require, while still remaining conformant to the standard. Particular implementations may choose to override the properties that are not normative. In that case, the implementer has the option of establishing a protocol to convey that information.

Properties may be informative for two main reasons.

  1. The nature of the property or the precise set of characters to which it applies are not well known and it therefore is too early to assign a normative property. Even if there was a precise description of how to interpret such a property, the fact that it is subject to a (planned) revision makes it less interesting to communicating implementations to rely on the specified behavior.
  2. Existing implementations show a range of behaviors for the same character, many or all of which may be equally useful choice on the part of their designers. Assigning a normative property would imply a unwarranted restriction on existing and established practice.

3. Definitions

PD1. Character Property
A character property defines a set of values and a mapping from each Unicode code points to one of the values of the set.
PD2. Property Value
One of the set of values associated with a character property.

For example, the [East Asian Width] property has the possible values "Narrow", "Neutral", "Wide", "Ambiguous" and "Unassigned".
PD3. Universal (Required) Property
A universal property applies to all Unicode characters. A universal property does not have a 'does not apply value'.
PD4. Limited Property
A property which is not defined or does not apply to all characters is called a limited property. It applies to only a subset of Unicode characters, usually one script, or a related family of scripts.
For example case, and case mapping information is not needed for unicameral scripts. Such information can in principle be left 'undefined' for the characters to which it does not apply. Limited properties are often implemented by listing property values for all characters and giving a special 'does-not-apply' value to all characters to which the limited property does not apply. 
PD4. Enumerated Property
A property with a fixed set of values. This is sometimes also known as a partition.
As characters are added to the Unicode Standard, the set of values may need to be extended in the future, but it is advantageous to think of enumerated properties of having a fixed set of possible values.
PD5. Closed Enumeration
An enumerated property for which the set of values is closed (i.e. it may not be extended for future versions of the Unicode Standard).
Note: Currently, the General Category is the only closed enumeration, other than Boolean properties.
PD6. Single Valued (Boolean) Property
A closed enumerated property whose set of values is limited to 'true', 'false', and  possibly 'undetermined.
Essentially the presence or absence of the property is the important information.
PD7. Integral Property
An integral property can take on any integer or real value. An example is the decimal digit value property. There is no implied limit to the number of possible distinct values for the property, short of the limitations of representing integers in computers.
PD8. General Numeric Property
A general numeric property can take on any integer, real, or complex value. An example is the numeric value property. There is no implied limit to the number of possible distinct values for the property, short of the limitations of representing integers, real or complex numbers in computers.
PD9. Normative Property
A normative property has conformance implications, see 4. Conformance. A normative character property must be paired with a precise description of how to interpret characters with each normative property value in a conformant way.
Note: A normative process that depends in a normative and testable way on a property, causes the property to be normative. For example, the interpretation of the bidirectional class is precisely defined in [Bidirectional Algorithm].
If a process does not interpret a given character, it may remain unaware of its properties - but is is recommended that processes use carefully chosen default values for characters that they don't handle.
PD10. Informative Property
An informative property is provided as helpful information to implementers. There are no requirements to implementations of the Unicode Standard.
Note: Informative properties capture expert implementation experience and their use is strongly recommended by the Consortium.
PD11. Simple Property
A property that applies to a character in isolation.
PD12. Character Behavior
A property that applies to a character in context of a longer character sequence
PD13. Stable Property
A property is stable with respect to a particular algorithm or process, if changes in the assignment of property values produce no changes in the outcome of the process or algorithm.
For example, while the absolute values of the canonical combining classes are not guaranteed to be the same between versions of the Unicode Standard, their relative values will be maintained. As a result, they are stable with respect to the Normalization Forms as defined in [Normalization].
PD14. Immutable Property
A property whose values, once assigned to a character, are fixed and will not be changed.   

An example of immutable, or fixed, properties are the code position and name of each Unicode character.
PD15. Overridable Property
A property whose values may be overridden by a higher level protocols.
PD16. Default Value
Value of a property to be used when encountering unassigned or unsupported characters. There may be more than one default value per property.

4. Conformance related considerations

4.1 Conformance Requirements

In Chapter 3, Conformance, The Unicode Standard [Unicode] states this conformance requirement:

Interpretation

C7 A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded character representation.

This conformance rule relies on two definitions in the Standard:

D1 Normative properties and behavior: The following are normative character properties and normative behavior of the Unicode Standard:

  1. Simple properties
  2. Character combination
  3. Canonical decomposition
  4. Compatibility decomposition
  5. Surrogate property
  6. Canonical ordering behavior
  7. Bidirectional behavior, as interpreted according to the Unicode bidirectional algorithm
  8. Conjoining jamo behavior, as interpreted according to Section 3.11, Conjoining Jamo Behavior

D2 Character semantics: The semantics of a character are established by its character name, representative glyph, and normative properties and behavior.

For a list of normative properties, see Table 5-1.

4.2 Overriding properties via Higher-level Protocols

The Unicode Standard makes these specific statements about overriding properties:

Some normative behavior is default behavior; this behavior can be overridden by higher-level protocols. However, in the absence of such protocols, the behavior must be observed so as to follow the character semantics.

• The character combination properties and the canonical ordering behavior cannot be overridden by higher-level protocols.

• Particular implementations may choose to override all  properties that are not normative. 

For interpreting directionality, higher-level protocols may:

Override the number handling to use information provided by a broader context. For example, information from other paragraphs in a document could be used to conclude that the document was fundamentally Arabic and that EN should generally be converted to AN.

Replace, supplement, or override the directional overrides or embedding codes. This task is accomplished by providing information via additional stylesheet or markup information about the embedding level or character direction. The interpretation of such information must always be defined by reference to the behavior of the equivalent explicit codes as given in the algorithm.

Override the bidirectional character types assigned to control codes to match the interpretation of the control codes within the protocol. (See also Section 13.1, Control Codes.)

Remap the number shapes to match those of another set. For example, remap the Arabic number shapes to have the same appearance as the European numbers.

 

5. Table of Character Properties

The following table attempts to list all character properties defined by the Unicode Standard and associated Unicode Standard Annexes, Unicode Technical Standards, or Unicode Technical Reports. Full listings for all Unicode properties are provided in the Unicode Character Database.

[ TBD: add the information specified in the PropList file into the table below.]

Table 5-1. Overview of Character Properties

Name Where Specified N/I Data Type Notes
Alias [NamesList] I Text see [NamesList-Format]
Alphabetic [Unicode], Section 4.10 I Boolean  
Arabic and Syriac Shaping Class [ArabicShaping] N Enum  
Bidirectional Class [UnicodeData], Field 4 N Enum see [Bidirectional Algorithm]
Canonical Decomposition [UnicodeData], Field 5 N Code Sequence no prefix, see also Compatibility Mapping
Case [UnicodeData], Field 4 = 
Lu | Ll | Lt
N Enum Derived: 
General Category
=  Lu | Ll | Lt
Case Folding [CaseFolding] I Code point see [Case Mapping]
Case Mapping (Lower) [UnicodeData], Field 13 N Code point only 1:1 mappings that are locale independent
Case Mapping (Upper) [UnicodeData], Field 12 N Code point only 1:1 mappings that are locale independent
Case Mapping (Title) [UnicodeData], Field 14 N Code point only 1:1 mappings that are locale independent 
Case Mapping (Special) [SpecialCasing] I Code sequence see [Case Mapping]
Code point [UnicodeData], Field 0 N Code point  
Combining Class [UnicodeData], Field 3 N 0..255  
Comments [NamesList] I Text see [Nameslist-Format]
Compatibility Mapping [UnicodeData], Field 5 I Prefix +
Code Sequence
prefix indicating mapping type
Control [UnicodeData], Field 4 = Cc N Boolean Derived: General Category = Cc
Cross Mappings Various I   see [Character Mapping Tables]
Dashes [Unicode], Table 6-2 N Boolean 207B, 208B, 2212 | General Category = Pd 
Default Sort Weight Various      See [Collation]
Digit, Decimal [UnicodeData], Field 4 = Nd N Boolean Derived: General Category = Nd
Digit, Decimal Value [UnicodeData], Field 6 N 0..9  
Digit, Value [UnicodeData], Field 7 N Integer  
East Asian Width [EastAsianWidth] I Enum see [East Asian Width]
General Category [UnicodeData], Field 2 N/I Enum see [UnicodeData-Format]
Identifier Extend [UnicodeData], Field 4 = 
Mn | Mc | Nd | Pc | Cf 
I Boolean Derived: 
General Category
= Mn | Mc | Nd | Pc | Cf 
Identifier Start [UnicodeData], Field 4 = 
Lu | Ll | Lt | Lm | Lo | Nl
I Boolean Derived: 
General Category
= Lu | Ll | Lt | Lm | Lo | Nl
Ideographic Derived I Boolean  
ISO comments [UnicodeData], Field 11 N Text  
Jamo Short Name [Jamo] N Text  
Letter Derived I Boolean see [Unicode], Section 4.10
Line Breaking Property [LineBreak] I Enum  
Mathematical Property [Unicode], Section 4.9 I Boolean  
Mirrored [UnicodeData], Field 9 N Boolean  
Name [UnicodeData], Field 1 N Text  
Numeric Value [UnicodeData], Field 8 I Real [Complex]  
Ideographs, Primary Numeric [Unicode], Table 4-7 I Integer  
Ideographs, Accounting Numbers [Unicode], Table 4-8 I Integer  
Private Use [UnicodeData], Field 4 = Cp N Boolean Derived: General Category = Cp
Related characters [NamesList] I Text  
Script [ScriptNames] I [Text / Enum?] see [Script Names]
Space [UnicodeData], Field 4 = Zs N Boolean Derived: General Category = Zs
Special [Unicode], Table 3-9 N various see chapter 13 of [Unicode]
Surrogates [UnicodeData], Field 4 = Cs N Text Derived: General Category = Cs
Unicode 1.0 Name [UnicodeData], Field 10 N Text  

Notes on the table

Code points and code sequences

Code points can range from 0000 to 10FFF and code sequences are sequences of code points. There is no pre-determined maximum length for a code sequence.

Derived Properties

Character properties that are noted derived can be implied from other character properties. For practical and historical reasons, the [UnicodeData] file is considered as the primary source of information in determining the direction of such 'derivation'. The notation Field = X | Y means that the value of the field is either X or Y for the property to be true.

6. Notes on Properties

Disclaimer

The content of all character property tables has been verified as far as possible by the Unicode Consortium. However, the Unicode Consortium does not guarantee that the tables printed in this volume or on the CD-ROM are correct in every detail, and it is not responsible for errors that may occur either in the character property tables or in software that implements these tables. The contents of all the tables in this chapter may be superseded or augmented by information on the Unicode Web site.

Cross correlations between properties: Some properties are maintained in or may be inferred from more than one location. As a result, several correlations between properties are true by design:

Decimal Digit Value present :=: General Category = Nd
Decimal Digit Value present :=: Decimal Digit Value = Numeric Value
Digit Value present :=: Digit Value = Numeric Value

Issues with overloaded enumerations: If an enumerated property is cobbled together from non-overlapping Boolean properties, the result may be difficult to apply or extend. The same applies if one attempts to use enumerated properties to convey distinctions they were not designed for. One either needs to keep subdividing and subdividing them into smaller subsets, using more values (something that cannot be done for a closed partition), or one must define alternative properties.

Issues with Boolean properties: If multiple Boolean properties are used to capture what are in effect mutually exclusive assignments of an enumerated value an essential fact, the mutual exclusiveness, can no longer be expressed in the property itself.

Consistency of Properties: The Unicode Standard is the product of many compromises. It has to strike a balance between uniformity of treatment for similar characters and compatibility with existing practice for characters inherited from legacy encodings. Because of this balancing act, one can expect a certain number of anomalies in character properties. For example, some pairs of characters might have been treated as canonical equivalents but are left unequivalent for compatibility with legacy differences. This situation pertains to U+00B5 µ MICRO SIGN (cf. U+03BC µ GREEK SMALL LETTER MU) as well as to certain Korean jamo.

Compatibility and Reanalysis: Some characters might have had existing properties differing in some ways from those assigned in this standard, but whose properties are left as is for compatibility with existing practice. This situation can be seen with the halfwidth voicing marks for Japanese (U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK and U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK), which might have been better analyzed as spacing combining marks, and with the conjoining Hangul jamo, which might have been better analyzed as an initial base character, followed by formally combining medial and final characters. In the interest of efficiency and uniformity in algorithms, implementations may take advantage of such reanalyses of character properties, as long as the results they produce do not overtly conflict with those specified by the normative properties of this standard.

6.1 Case—Normative

Case is a normative property of characters in certain alphabets whereby characters are considered to be variants of a single letter. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lower-case letter (also known as small or minuscule). The uppercase letter is generally larger than the lowercase letter.  

Because of the inclusion of certain composite characters for compatibility, such as U+01F1 LATIN CAPITAL LETTER DZ, a third case, called titlecase, is used where the first character of a word must be capitalized. An example of such a character is U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z. The three case forms are UPPERCASE, Titlecase, lowercase.  

For those scripts that have case (Latin, Greek, Cyrillic, Armenian, and archaic Georgian), the case of a Unicode character can usually be obtained from the character’s name. This statement is true for only these five scripts. Uppercase characters typically contain the word capital in their names. Lowercase characters typically contain the word small. The word small in the names of characters from scripts other than the five just listed has nothing to do with case. (Note that while the archaic Georgian script contained upper- and lowercase pairs, they are rarely used in modern Georgian. See Section 7.5, Georgian.)

Case Mappings. The lowercase letter default case mapping occurs between the small character and the capital character. The Unicode Standard case mapping tables, which are informative, are on the CD-ROM. Exceptions to the normal casing rules can be found in the data file SpecialCasing.txt. For more information on case mappings, see Section 5.18, Case Mappings, and Unicode Technical Report #21, "Case Mappings," on the CD-ROM or the up-to-date version on the Unicode Web site.

6.2 Combining Classes—Normative

Each combining character has a normative canonical combining class. This class is used with the canonical ordering algorithm to determine which combining characters interact typographically and to determine how the canonical ordering of sequences of combining characters takes place. Class zero combining characters act like base letters for the purpose of determining canonical order. Combining characters with non-zero classes participate in reordering for the purpose of determining the canonical form of sequences of characters. (See Section 3.10, Canonical Ordering Behavior, for a description of the algorithm.)

The list of combining characters and their canonical combining class appears in Table 4 - 3 . Most combining characters are nonspacing. The spacing, class zero, combining characters are so noted.

Table 4-3. Combining Classes

<listing omitted>

6.3 Directionality—Normative

Directional behavior is interpreted according to the Unicode bidirectional algorithm (see Section 3.12, Bidirectional Behavior). For this purpose, all characters of the Unicode Standard possess a normative directional type. The directional types left-to-right and right-to-left are called strong types, and characters of these types are called strong directional characters. Left-to-right types include most alphabetic and syllabic characters, as well as all Han ideographic characters. Right-to-left types include Arabic, Hebrew, Syriac, and Thaana, and most punctuation specific to those scripts. In addition, the Unicode bidirectional algorithm also uses weak types and neutrals.

For the directional types of Unicode characters, see the Unicode Character Database on the CD-ROM.

6.4 Jamo Short Names—Normative

The jamo short name is a normative property of the Unicode conjoining Hangul jamo characters. These short names, which are listed in Tab l e 4 - 4 , are used to determine the character names that are derived when decomposing Hangul syllables into their decomposition sequence.

Table 4. Jamo Short Names

<listing omitted>

6.5 General Category—Normative

The General Category is a broad categorization of all character according to their principal use. It constitutes a partition of the characters into several major classes, such as letters, punctuation, and symbols, and further subclasses for each of the major classes. It is specifically designed to support a wide variety of common parsing tasks, including, but not limited to identifier syntax, regular expression processing, and word boundary detection. A common use of the General Category of a Unicode character is to assist in determination of boundaries in text, as in Section 5.15, Locating Text Element Boundaries. Another common use is in determining language identifiers for programming, scripting, and markup, as in Section 5.16, Identifiers. This property is also used to support common APIs such as isLetter(), isUppercase(), and so on.

Many tasks will require specific overrides, or specializations for some characters, in some cases different overrides, dependent on locale. For some tasks, the specializations needed were extensive enough to warrant a separate property, for an example see [Line Breaking] . In other cases, alternative categorizations are needed that overlap some of the General Categories values, but not others, for example the Mathematical Property is true for all characters with General Category = Sm, but the reverse is not true.

Each Unicode character is assigned a General Category value. Each value of the General Category is defined as a two-letter abbreviation, where the first letter gives information about a major class and the second letter designates a subclass of that major class. In each class, the subclass "other" merely collects the remaining characters of the major class. For example, the subclass "No" (Number, other) includes all characters of the Number class that are not a decimal digit or letter. These characters may have little in common besides their membership in the same major class.

Characters with General Category values Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character.  

Table 4 - 5 enumerates the values of General Category, with a short description of each value.

Table 5. General Category

Category Notes

Lu = Letter, uppercase

Ll = Letter, lowercase

Lm = Letter, modifier

Lo = Letter, other

Lt = Letter, titlecase

Mn = Mark, nonspacing

Mc = Mark, spacing combining

Me = Mark, enclosing

Nd = Number, decimal digit

Nl = Number, letter

No = Number, other

Pc = Punctuation, connector

Pd = Punctuation, dash

Ps = Punctuation, open

Pe = Punctuation, close

Pi = Punctuation, initial quote

Pf = Punctuation, final quote

Po = Punctuation, other

Sm = Symbol, math

Sc = Symbol, currency

Sk = Symbol, modifier

So = Symbol, other

Zs = Separator, space

Zl = Separator, line

Zp = Separator, paragraph

Cc = Other, control

Cf = Other, format

Cs = Other, surrogate

Co = Other, private use

Cn = Other, not assigned

 

 

 

 

 

 

 

 

 

 

 

 

see also line break property

see also line break property

may behave like Ps or Pe depending on usage

may behave like Ps or Pe depending on usage

 

see also math property

 

 

 

also format charcters

also a format character

also a format character

 

 

 

 

 

            * may behave like Ps or Pe depending on usage

Limitations. Some of the values in the General Category values are better defined than others. Letters, digits, combining marks, spaces, and formatting characters are some of the well defined values. The subdivision of the remainder into subclasses is then based on an implicit and unstated hierarchy where the less well defined categories, such as "Punctuation" and "Symbol", represent the left-overs.  The subdivision of punctuation into subtypes, such as opening and closing cannot be uniquely done (see the quotation marks) and is less detailed than needed for important implementations, such as line breaking. In addition, the assignment of Symbol cannot be made coherent, since there are instances of letters functioning as symbols (e.g., the letterlike symbols), and instances of symbols functioning as letters (e.g., some of the modifier letters). Where detailed analysis of symbols or punctuation is required, it is recommended that implementers not rely solely on General Category assignments, but consider other, more specific properties as well.

This can be compared to the use of 1:1 case transforms in [UnicodeData] and the existence of a separate SpecialCasing.txt to give the complete answer.

6.6 Numeric Value—Normative

Numeric value is a normative property of characters that represent numbers. This group includes characters such as fractions, subscripts, superscripts, Roman numerals, currency numerators, encircled numbers, and script-specific digits. In many traditional numbering systems, letters are used with a numeric value. Examples include Greek and Hebrew letters as well as Latin letters used in outlines (II.A.1.b). These special cases are not included here as numbers.

Decimal digits form a large subcategory of numbers consisting of those digits that can be used to form decimal-radix numbers. They include script-specific digits, not characters such as Roman numerals (1 + 5 = 15 = fifteen, but I + V = IV = four), subscripts, or super-scripts. Numbers other than decimal digits can be used in numerical expressions, but it is up to the user to determine the specialized uses.

The Unicode Standard assigns distinct codes to the forms of digits that are specific to a given script or language. Examples are the digits used with the Arabic script, Chinese numbers, or those of the Indic languages. For naming conventions, see the introduction to Section 8.2, Arabic.

Table 4 - 6 gives the numeric values of Unicode characters that can represent numbers. Some

CJK ideographs also have numeric values; those are not included in Table 4 - 6, but are discussed following the table.

Table 6. Numeric Properties

<listing omitted>

CJK ideographs from the Unified Repertoire and Ordering also may have numeric values. The primary numeric ideographs are shown in Table 4 - 7 . When used to represent numbers in decimal notation, zero is represented by U+3007. Otherwise, zero is represented by U+96F6.

Ideographic accounting numbers are commonly used on checks and other financial instruments to minimize the possibilities of misinterpretation or fraud in the representation of numerical values. The set of accounting numbers varies somewhat between Japanese, Chinese, and Korean usage. Table 4 - 8 gives a fairly complete listing of the known accounting characters. Some of these characters are ideographs with other meanings pressed into service as accounting numbers; others are used only as accounting numbers.

Table 7. Primary Numeric Ideographs

U+96F6

0

U+4E00

1

U+4E8C

2

U+4E09

3

U+56DB

4

U+4E94

5

U+516D

6

U+4E03

7

U+516B

8

U+4E5D

9

U+5341

10

U+767E

100

U+5343

1,000

U+4E07

10,000

U+5104

100,000,000 (10,000 × 10,000)

U+5146

1,000,000,000,000 (10,000 × 10,000 × 10,000)

Table 8. Ideographs Used as Accounting Numbers

1

U+58F9, U+58F1, U+5F0C a

2

U+8CAE a , U+8D30 a , U+5F10 a , U+5F0D a

3

U+53C3, U+53C2, U+53C1 a , U+5F0E a

4

U+8086

5

U+4F0D

6

U+9678, U+9646

7

U+67D2 b

8

U+634C

9

U+7396

10

U+62FE

100

U+4F70 a , U+964C

1,000

U+4EDF

10,000

U+842C

a. These characters are used only as accounting numbers, and have no other meaning.
b. In Japan, U+67D2 is also pronounced
urusi, meaning "lacquer," and is treated as a variant of the standard character for "lacquer" U+6F06.

6.7 Mirrored—Normative

Mirrored is a normative property of characters such as parentheses, whose images are mirrored horizontally in text that is laid out from right to left. For example, U+0028 LEFT PARENTHESIS is interpreted as opening parenthesis; in a left-to-right context it will appear as "(", while in a right-to-left context it will appear as the mirrored glyph ")". The list of mirrored characters appears in Table 4 - 9 . Note that mirroring is not limited to paired characters, but that any character with the mirrored property will need two mirrored glyphs. This requirement is necessary to render the character properly in a bidirectional context.

6.8 Unicode 1.0 Names

The Unicode 1.0 character name is an informative property of the characters defined in Version 1.0 of the Unicode Standard. The names of Unicode characters were changed in the process of merging the standard with ISO/IEC 10646. The Version 1.0 character names can be obtained from the CD-ROM accompanying the standard or from the ftp site. See also Appendix D, Changes from Unicode Version 2.0. Where the Version 1.0 character name provides additional useful information, it is listed in Chapter 14, Code Charts. For example, U+00B6 PILCROW SIGN has its Version 1.0 name, PARAGRAPH SIGN, listed for clarity.

6.9 Mathematical Property

The mathematical property is an informative property of characters that are used as operators in mathematical formulas. The mathematical property may be useful in algorithms that deal with the display of mathematical text and formulas. However, a number of these characters have multiple usages and may occur with nonmathematical semantics. For example, U+002D HYPHEN-MINUS may also be used as a hyphen—and not as a mathematical minus sign. Other characters, including some alphabetic, numeric, punctuation, spaces, arrows, and geometric shapes, are used in mathematical expressions as well, but are even more dependent on the context for their identification. The Unicode characters in the following lists have the mathematical property.

Characters with the math property and the Sm General Category:

002B, 003C..003E, 007C, 007E, 00AC, 00B1, 00D7, 00F7, 2044,

207A..207C, 208A..208C, 2190..2194, 219A..219B, 21A0, 21A3, 21A6,

21AE, 21CE..21CF, 21D2, 21D4, 2200..22F1, 2308..230B, 2320..2321,

25B7, 25C1, 266F, FB29, FE62, FE64..FE66, FF0B, FF1C..FF1E, FF5C,

FF5E, FFE2, FFE9..FFEC

Characters with the math property and other General Category values:

0028..002A, 002D, 002F, 005B..005E, 007B, 007D, 2016, 2032..2034,

207D..207E, 208D..208E, 20D0..20DC, 20E1, 2329..232A, 300A..300B,

301A..301B, FE35..FE38, FE59..FE5C, FE61, FE63, FE68, FF08..FF0A,

FF0D, FF0F, FF3B..FF3E, FF5B, FF5D

6.10 Letters and Other Useful Properties

The CD-ROM that accompanies the Unicode Standard contains data files that list other useful, informative properties of Unicode characters. The full list of those properties can be found in the data files; see, in particular, PropList.txt. This section highlights some of those properties that have a bearing on such implementation issues as parsing of identifiers. (See also Section 5.16, Identifiers.)

Computer language standards often characterize identifiers as consisting of letters, syllables, ideographs, and digits, but do not specify exactly what a "letter," "syllable," "ideograph," or "digit" is, leaving the definitions implicitly either to a character encoding standard or to a locale specification. The large scope of the Unicode Standard means that it includes many writing systems for which these distinctions are not as self-evident as they may once have been for systems designed to work primarily for Western European languages and Japanese. In particular, while the Unicode Standard includes various "alphabets" and "syllabaries," it also includes writing systems that fall somewhere in between. As a result, no attempt is made to draw a sharp property distinction between letters and syllables.

Letter. This informative property applies to characters that are used to write words. This group includes characters such as capital letters, small letters, ideographs, hangul, and spacing modifier letters. Combining marks generally assume the letter property of the preceding base character. For example, when searching for word boundaries, combining characters don’t break from previous letters. The letter property mappings can be obtained from the CD-ROM accompanying the standard.

Alphabetic. The alphabetic property is an informative property of the primary units of alphabets and/or syllabaries, whether combining or noncombining. Included in this group would be composite characters that are canonical equivalents to a combining character sequence of an alphabetic base character plus one or more combining characters; letter digraphs; contextual variants of alphabetic characters; ligatures of alphabetic characters; contextual variants of ligatures; modifier letters; letterlike symbols that are compatibility equivalents of single alphabetic letters; and miscellaneous letter elements. Notably, U+00AA FEMININE ORDINAL INDICATOR and U+00BA MASCULINE ORDINAL INDICATOR are simply abbreviatory forms involving a Latin letter and should be considered alphabetic rather than nonalphabetic symbols.

Ideographic. The ideographic property is an informative property of the Unified CJK Ideograph set (U+4E00..U+9FA5); the CJK Ideograph Extension A set (U+3400..U+4DB5); the CJK Compatibility Ideograph set (U+F900..U+FA2D); U+3007 IDEOGRAPHIC NUMBER ZERO; U+3006 IDEOGRAPHIC CLOSING MARK; and the Hangzhou-style numerals (U+3021..U+3029, U+3038..U+303A).

6.11 East Asian Width

East Asian Width is a limited property, classifying characters based on their width in East Asian legacy implementations. While presented as a closed enumeration, the EAW value "Neutral" essentially is the 'does not apply' value.

6.12 ISO comment

ISO/IEC 10646 provides for a comment field in the character name itself, while Unicode uses a more extensive set of annotations. Combining the information in this field with the Unicode Character name, provides the complete name as it appears in ISO/IEC 10646. 

6.13 Sort Weights

<see TR 10>

6.14 Line Breaking

<see TR 14>

6.15 Decompositions

<see TR 15>

6.16 Special Property

Table 3-9 in [Unicode] lists those characters that have particular properties. Their properties are described individually in chapter 13 of [Unicode]. These characters can be grouped as follows.

NEW: 3.1/3.2

 

6.21 Case Mapping

<see TR21>

6.22 Character Mappings

<see TR22>

6.24 Script Name

<see UTR 24: Script Names>

7. Updating Properties and Extending the Standard 

7.1 Updating Properties

Updates to the Unicode Character Database can be required for three reasons

  1. To cover new characters added to the Unicode Standard
  2. To add new properties
  3. To change the assigned values for a property for some characters

Changing a characters property assignment invalidates existing implementations and is therefore something that is done judiciously and with great care when there is no better alternative.

7.2 Guarantees

For some properties, some of the following aspects are guaranteed to be invariant.

The status of a property as normative does not imply a stability guarantee.

8. Special Property Values

8.1 N/A Value

Limited properties apply to only a subset of characters. Where these properties are implemented as a partition (required property) the characters to which the property does not apply is given a special value denoting that the property does not apply.

8.2 Default Value

Implementations often need specific properties for all code points, including those that are unassigned. To meet this need, the Unicode standard assigns default properties to ranges of unassigned code points.

All implementations of the Unicode Standard should endeavor to handle additions to the character repertoire gracefully. In some cases this may require that an implementation attempts to 'anticipate' likely property values for Code points for which characters have not yet been defined, but where surrounding characters exist that make it probable that similar characters will be assigned to the Code point in question.

There are three strategies

  1. Rely on the recommendation from The Unicode Consortium. For example, for the Bidirectional Class, the Unicode Consortium has published recommended default values for all code points.
  2. Treat the unassigned areas of a block as if they had property values common to other characters of the block. A variation of this scheme bridges 'holes' in the allocation by using the property values for the characters bracketing the hole.
  3. Give unassigned code location a different default property that will result in graceful, if not completely correct behavior if encoded characters are later encountered at that location.

Each of these strategies has advantages and drawbacks, and none can guarantee that the behavior of an implementation that is conformant to a prior version of the Unicode Standard will support characters added in a later version of the Unicode Standard in precisely the same way as an implementation that is conformant to the later version. The most that can be hoped for, is that the earlier implementation will behave gracefully in such circumstances.

Default values are temporary: they will be superseded by final assignments, once characters are assigned to a given code point.

8.3 Undetermined Property Values

For many archaic scripts (as well as for not yet fully implemented modern ones) essential characteristics of many characters may not be knowable at the time of their publication. In these cases the proper assignments of property values for newly encoded characters cannot be reliably determined at the time the characters are first added to the Unicode Standard, or for a new property, when the property is first added to the Unicode Character Database. In these cases, and where the property is a required property, it will be given a value of 'undetermined', or 'unknown at time of publication'.

8.4 Preliminary Property Assignments

Sometimes, a determination and assignment of property values can be made, but the information on which it was based may be incomplete or preliminary. In such cases, the property value may be changed when better information becomes available. Currently, there is no machine readable way to provide information about the confidence of a property assignment; however, the text of the Standard or a Technical Report defining the property may provide general indications of preliminary status of property assignments where they are known.

9. Working with properties

[The text in this section is very preliminary].

There are two main issues in working with properties

9.1 Efficient storage and access to property information

The Unicode Standard provides detailed information on character properties (see Chapter 4, Character Properties, and the Unicode Character Database on the accompanying CD-ROM).

These properties can be used by implementers to implement a variety of low-level processes. Fully language-aware and higher-level processes will need additional information.

A two-stage table, as described in Section 5.1, Transcoding to Other Standards, can also be used to handle mapping to character properties or other information indexed by character code. For example, the data from the Unicode Character Database on the accompanying CD-ROM can be represented in memory very efficiently as a set of two-stage tables.

Individual properties are common to large sets of characters and therefore lend themselves to implementations using the shared blocks.

Many popular implementations are influenced by the POSIX model, which provides functions for separate properties, such as isalpha, isdigit, and so on. Implementers of Unicode-based systems and internationalization libraries need to take care to extend these concepts to the full set of Unicode characters correctly.

In Unicode-encoded text, combining characters participate fully. In addition to providing callers with information about which characters have the combining property, implementers and writers of language standards need to provide for the fact that combining characters assume the property of the preceding base character (see also Section 3.5, Combination, and Section 5.16, Identifiers). Other important properties, such as sort weights, may also depend on a character’s context.

Because the Unicode Standard provides such a rich set of properties, implementers will find it useful to allow access to several properties at a time, possibly returning a string of bit-fields, one bit-field per character in the input string.

In the past, many existing standards, such as the C language standard, assumed very minimalist "portable character sets" and geared their functions to operations on such sets. As the Unicode encoding itself is increasingly becoming the portable character set, implementers are advised to distinguish between historical limitations and true requirements when implementing specifications for particular text processes.

<< mapping  or properties?>>

Multistage Tables

Tables require space. Even small character sets often map to characters from several different blocks in the Unicode Standard, and thus may contain up to 64K entries in at least one direction. Several techniques exist to reduce the memory space requirements for mapping tables. Such techniques apply not only to transcoding tables, but also to many other tables needed to implement the Unicode Standard, including character property data, collation tables, and glyph selection tables.

Flat Tables. If diskspace is not at issue, virtual memory architectures yield acceptable working set sizes even for flat tables because frequency of usage among characters differs widely and even small character sets contain many infrequently used characters. In addition, data intended to be mapped into a given character set generally does not contain characters from all blocks of the Unicode Standard (usually, only a few blocks at a time need to be transcoded to a given character set). This situation leaves large sections of the 64K-sized reverse mapping tables (containing the default character, or unmappable character entry) unused—and therefore paged to disk.

Ranges. It may be tempting to "optimize" these tables for space by providing elaborate pro-visions for nested ranges or similar devices. This practice leads to unnecessary performance penalties on modern, highly pipelined processor architectures because of branch penalties.

A faster solution is to use an optimized two-stage table, which can be coded without any test or branch instructions. Hash tables can also be used for space optimization, although they are not as fast as multistage tables.

Two-Stage Tables. Two-stage (high-byte) tables are a commonly employed mechanism to reduce table size (see Figure 5-1). They use an array of 256 pointers and a default value. If a pointer is NULL, the returned value is the default. Otherwise, the pointer references a block of 256 values.

Figure 5-1. Two-Stage Tables

Optimized Two-Stage Table. Wherever any blocks are identical, the pointers just point to the same block. For transcoding tables, this case occurs generally for a block containing only mappings to the "default" or "unmappable" character. Instead of using NULL pointers and a default value, one "shared" block of 256 default entries is created. This block is pointed to by all first-stage table entries, for which no character value can be mapped. By avoiding tests and branches, this strategy provides access time that approaches the simple array access, but at a great savings in storage.

Given an arbitrary 64K table, it is a simple matter to write a small utility that can calculate the optimal number of stages and their width.

Range tables

[additional guidelines, TBD]

10. Data Management and Distribution

10.1 Versions of the Unicode Character Database

The Unicode Character Database is provided as a collection of flat text files, as described in [UnicodeCharacterDatabase]. Each version of the database has its own, numbered directory on the Unicode ftp site [ftp-http]. In each versioned directory, the filenames of all files carry their version number in the file names. The latest version of the database is replicated in the directory named UNIDATA [Unidata], which uses constant file names (i.e. without version number). It is thus possible to reference both a specific version as well as the latest version of a database file (or the database itself) without need to update the links.

For more information about the versions of the Unicode Standard see [UnicodeVersions].

10.2 File Syntax Conventions

All files need a (formal) syntax description at least as detailed as namelist.txt has - it's probably useful to set up three or four templates for this now, so that we don't get new formats for each proposed property.

The majority of plain text files in the Unicode Character Database use this general format:

LINE := COMMENT
             CODE_RANGE ( ";" VALUE) + { COMMENT }

VALUE := file specific, often a two letter abbreviation or a small number

COMMENT := '#' arbitrary text

CODE_RANGE = CODE { ".."CODE }

CODE := HHHH { { H } H }

H := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'

By convention, the file starts with a headerI, i.e. several comment lines identifying title, version, date, author, purpose and content of the file, as well as the permissible values and meaning for each field. By convention, following the last field in each line, a comment contains the character name. This comment should be ignored in machine processing of the file.

An XML version of the Unicode Character Database is under preparation.

10.3 Beta Versions of the Character Database

Whenever an update of the database is being developed, a beta version may be released in a directory whose name contains the word 'beta' in the directory name and in all filenames contained therein. This beta directory will be provided solely for review purposes and will not be maintained after the end of the beta period.

11. References

[ArabicShaping]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/ArabicShaping.txt>
[Bidirectional Algorithm]
Mark Davis, Unicode Standard Annex #9: The Bidirectional Algorithm, <http://www.unicode.org/unicode/reports/tr9>
[CaseFolding]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt>
[Case Mapping] 
Mark Davis, Unicode Technical Report #21: Case Mapping, <http://www.unicode.org/unicode/reports/tr21>
[Character Mapping Tables]
Mark Davis, Unicode Technical Report #22: Character Mapping Tables, <http://www.unicode.org/unicode/reports/tr22>
[Collation]
Mark Davis, Unicode Technical Report #10: Collation, <http://www.unicode.org/unicode/reports/tr10>
[EastAsianWidth]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt>
[East Asian Width
Asmus Freytag, Unicode Standard Annex #11, East Asian Width, <http://www.unicode.org/unicode/reports/tr11>
[Jamo]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/Jamo.txt>
[LineBreak]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/LineBreak.txt>
[Line Breaking]
Asmus Freytag, Unicode Standard Annex #14: Line Breaking Properties, <http://www.unicode.org/unicode/reports/tr14>
[NamesList]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/NamesList.txt>
[NamesList-Format]
Readme file <ftp://ftp.unicode.org/Public/UNIDATA/NamesList.html>
[SpecialCasing]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt>
[Unicode]
The Unicode Standard, Version 3.0, Addison Wesley Longman, 2000.
[UnicodeCharacterDatabase]
Readme file, <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html>
[UnicodeData]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>
[UnicodeData-Format]
Readme file <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html>
[UnicodeVersions]
Versions of the Unicode Standard <http://www.unicode.org/unicode/standard/versions>

Acknowledgements

The author wishes to thank Ken Whistler and Mark Davis for his insightful comments.

Changes from previous drafts

Changes form second working draft:  None, this is the first initial draft submitted to UTC.

Changes from third working draft: Reordered according to UTC feedback. Added information about maintenance, extended discussion on updates.

Changes from fourth working draft: Reworded the definitions based on feedback from Mark Davis. Some other minor changes.

Changes from fifth working draft: Added a few definitions. Cleaned up some of the text.

Changes from sixth working draft: Retargeted according to UTC feedback.

Changes from sevent working draft:

Changes from eighth working draft: incorporating individual feedback

Changes from ninth working draft: removing redundant text, word smithing


Copyright © 2000-2001 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.