Technical Reports |
Version | 1.0 |
Author | Asmus Freytag (asmus@unicode.org) |
Date | 2002-04-23 |
This Version | http://www.unicode.org/unicode/reports/tr23/tr23-1.html |
Previous Version | none |
Latest Version | http://www.unicode.org/unicode/reports/tr23/ |
Tracking Number | 1 |
This report presents a survey of the character properties defined in the Unicode Standard as well as guidelines to their usage.
Status
This document has been approved by the Unicode Technical Committee for public review as a Proposed Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
Significant changes of the information in this proposed draft are expected as result of public and UTC review.
Please send comments to the authors. A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
The References provide related information that is useful in understanding this document. Please mail corrigenda and other comments to the author(s).
[Editorial notes for the benefit of reviewers are indicated like this.]
Contents
This survey provides discussion of common aspects of character properties. This survey is not intended to supersede chapter 4 in the book, nor the existing body of technical reports and documentation files in the Unicode Character Database that provide detailed descriptions for particular character properties. Instead it presents a capsule summary only and references the corresponding technical report for the full details.
This report specifically covers formal character properties, which are those attributes of characters that are specified according to the definitions set forth in this report.
The Unicode Standard views character semantics as inherent to the definition of a character and conformant processes are required to take these into account when interpreting characters. The assignment of character semantics for the Unicode Standard is based on character behavior. For other character set standards, it is left to the implementer, or to unrelated secondary standards, to assign character semantics to characters. In contrast, the Unicode Standard supplies a rich set of character attributes, called properties, for each character contained in it. Many properties are specified in relation to processes or algorithms that interpret them, in order to implement the discovered character behavior.
The interpretation of some properties (such as the case of a character) is largely independent of context, whereas the interpretation of others (such as directionality) is applicable to a character sequence as a whole, rather than to the individual characters that compose the sequence.
Other examples that require context include the classification of neutrals in script assignments or title casing. The line breaking rules of TR#14 involve character pairs and and triples, and in certain cases, longer sequences. The glyph(s) defined by a combining character sequence are the result of contextual analysis in the display shaping engine. Isolated character properties typically only tell part of the story.
When modeling character behavior with computer processes, formal character properties are assigned in order to achieve the expected results. Such modeling depends heavily on algorithms. In some cases, a given character property is specified in close conjunction with a detailed specification of an algorithm. In other cases, algorithms are implied but not specified, or there are several algorithms can make use of the same general character property. The last case may require occasional differences in character property assignment to make all algorithms work correctly. This can usually be achieved by overriding specific properties for specific algorithms.
When assigning character properties for use with a given algorithm, it may be tempting to assign somewhat values to some characters, as long as the algorithm happens to produce the expected results. Proceeding in this way hides the nature of the character and limits the re-use of character properties by related processes. Therefore, instead of tweaking the properties to simply make a particular algorithm easier, the Unicode Standard pays careful attention to the underlying essential linguistic identity of the character. However, not all aspects of a characters identity are relevant in all circumstances, and some characters can be used in many different ways, depending on context or circumstance. Because of this the formal character properties alone are not sufficient to describe the complete range of desirable or acceptable character behaviors.
As specified in Chapter 3, Conformance, the Unicode Standard [UNICODE] defines both normative and informative properties.
Normative Properties.Normativemeans that implementations that claim conformance to the Unicode Standard (at a particular version) and that make use of a particular property must follow the specifications of the standard for that property to be conformant. The term normative when applied to a character property does not mean that the value of the property will never change. Corrections and extensions to the standard in the future may require minor changes to normative values, even though the Unicode Technical Committee strives to minimize such changes.
By making a property normative, the Unicode Standard guarantees that conformant implementations can rely on the fact that other conformant will interpret the character in the same way. This is most useful for those properties where the Unicode Standard provides precise rules for the interpretation of characters based on their properties. An example are the bidirectional properties and their use by the bidirectional algorithm. For some character properties, for example the general category, the Unicode standard does note define what model of processing it is intended to support and what the required consequences are of a character being e.g. "Letter Other" as opposed to "Symbol Other". In the absence of such definition, the only effect of conformance that can be tested in a strict manner is whether a character property library returns the correct value to its caller.
Note: one trivial, but important instance of conformant implementation is runtime access to a character property database. For normative properties, conformant implementations guarantee that the returned values match the values defined by the Unicode Consortium.
Informative Properties. An informative character property is strongly recommended, but a conformant implementation is free to use or change such values as it may require, while still remaining conformant to the standard. Particular implementations may choose to override the properties that are not normative. In that case, the implementer has the option of establishing a protocol to convey that information.
Properties may be informative for two main reasons.
PD3. Property Value
In Chapter 3, Conformance, The Unicode Standard [Unicode] states that "A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded character representation." The semantics of a character are established by taking its coded representation, character name and representative glyph in context and are further defined by its normative properties and behavior. Neither character name nor representative glyphs can be relied upon absolutely; a character may have a broader range of use than the most literal interpretation of its character name, and the representative glyph is only indicative of one of a range of typical glyphs representing the same character.
The Unicode Standard makes these specific statements about overriding properties:
Some normative behavior is default behavior; this behavior can be overridden by higher-level protocols. However, in the absence of such protocols, the behavior must be observed so as to follow the character semantics.
• The character combination properties and the canonical ordering behavior cannot be overridden by higher-level protocols.
• Particular implementations may choose to override all properties that are not normative.
For interpreting directionality, higher-level protocols may:
• Override the number handling to use information provided by a broader context. For example, information from other paragraphs in a document could be used to conclude that the document was fundamentally Arabic and that EN should generally be converted to AN.
• Replace, supplement, or override the directional overrides or embedding codes. This task is accomplished by providing information via additional stylesheet or markup information about the embedding level or character direction. The interpretation of such information must always be defined by reference to the behavior of the equivalent explicit codes as given in the algorithm.
• Override the bidirectional character types assigned to control codes to match the interpretation of the control codes within the protocol. (See also Section 13.1, Control Codes.)
• Remap the number shapes to match those of another set. For example, remap the Arabic number shapes to have the same appearance as the European numbers.
[Ed. Note: Section 5 and 6 have been removed from this draft]Updates to the Unicode Character Database can be required for three reasons
Changing a characters property assignment invalidates existing implementations and is therefore something that is done judiciously and with great care when there is no better alternative.
The consortium will endeavor to keep the values of all character properties
as stable as possible, but some circumstances may arise that require
changing them. In particular, as Unicode encodes less-well documented
scripts (such as for minority languages in Thailand) the exact character
properties and behavior may not be known at the time the script is first
encoded.
For some properties, some of the following aspects are guaranteed to be invariant.
The status of a property as normative does not imply a stability guarantee.
Once a character is encoded, its code position and name are immutable properties.
Mistakes in naming are noted in the nameslist in a note or by using an alias, but the formal name remains unchanged.
Once a character is encoded, its canonical combining class and decomposition (canonical or combining) are stable with respect to normalization.
If a string contains only characters from a given version of the Unicode Standard (say Unicode 3.1), and it is put into a normalized form in accordance with that version of Unicode, then it will be in normalized form when according to any past or future versions of Unicode.
Note: If an implementation normalizes a string that contains
characters that are not assigned in the version of Unicode that it
supports, that string may not be assessed as being in normalized form
according to a future version of Unicode. For example, suppose that a
Unicode 3.0 program normalizes a string that contains new Unicode 3.1
characters. That string may not be normalized according to Unicode 3.1.
The General Category and Bidi Category are closed enumerations.
In other words they will not be further subdivided.
Further description of these is provided in described in UnicodeData.html:
Limited properties apply to only a subset of characters. Where these properties are implemented as a partition (required property) the characters to which the property does not apply is given a special value denoting that the property does not apply.
Implementations often need specific properties for all code points, including those that are unassigned. To meet this need, the Unicode standard assigns default properties to ranges of unassigned code points.
All implementations of the Unicode Standard should endeavor to handle additions to the character repertoire gracefully. In some cases this may require that an implementation attempts to 'anticipate' likely property values for Code points for which characters have not yet been defined, but where surrounding characters exist that make it probable that similar characters will be assigned to the Code point in question.
There are three strategies
Each of these strategies has advantages and drawbacks, and none can guarantee that the behavior of an implementation that is conformant to a prior version of the Unicode Standard will support characters added in a later version of the Unicode Standard in precisely the same way as an implementation that is conformant to the later version. The most that can be hoped for, is that the earlier implementation will behave gracefully in such circumstances.
Default values are temporary: they will be superseded by final assignments, once characters are assigned to a given code point.
For non-character codes, a property returning API would return the same value as the default value for unassigned characters.
For many archaic scripts (as well as for not yet fully implemented modern ones) essential characteristics of many characters may not be knowable at the time of their publication. In these cases the proper assignments of property values for newly encoded characters cannot be reliably determined at the time the characters are first added to the Unicode Standard, or for a new property, when the property is first added to the Unicode Character Database. In these cases, and where the property is a required property, it will be given a value of 'undetermined', or 'unknown at time of publication'.
Sometimes, a determination and assignment of property values can be made, but the information on which it was based may be incomplete or preliminary. In such cases, the property value may be changed when better information becomes available. Currently, there is no machine readable way to provide information about the confidence of a property assignment; however, the text of the Standard or a Technical Report defining the property may provide general indications of preliminary status of property assignments where they are known.
[The text in this section is very preliminary].
There are two main issues in working with properties
The Unicode Standard provides detailed information on character properties (see Chapter 4, Character Properties, and the Unicode Character Database on the accompanying CD-ROM).
These properties can be used by implementers to implement a variety of low-level processes. Fully language-aware and higher-level processes will need additional information.
A two-stage table, as described in Section 5.1, Transcoding to Other Standards, can also be used to handle mapping to character properties or other information indexed by character code. For example, the data from the Unicode Character Database on the accompanying CD-ROM can be represented in memory very efficiently as a set of two-stage tables.
Individual properties are common to large sets of characters and therefore lend themselves to implementations using the shared blocks.
Many popular implementations are influenced by the POSIX model, which provides functions for separate properties, such as isalpha, isdigit, and so on. Implementers of Unicode-based systems and internationalization libraries need to take care to extend these concepts to the full set of Unicode characters correctly.
In Unicode-encoded text, combining characters participate fully. In addition to providing callers with information about which characters have the combining property, implementers and writers of language standards need to provide for the fact that combining characters assume the property of the preceding base character (see also Section 3.5, Combination, and Section 5.16, Identifiers). Other important properties, such as sort weights, may also depend on a character’s context.
Because the Unicode Standard provides such a rich set of properties, implementers will find it useful to allow access to several properties at a time, possibly returning a string of bit-fields, one bit-field per character in the input string.
In the past, many existing standards, such as the C language standard, assumed very minimalist "portable character sets" and geared their functions to operations on such sets. As the Unicode encoding itself is increasingly becoming the portable character set, implementers are advised to distinguish between historical limitations and true requirements when implementing specifications for particular text processes.
Multistage Tables
Tables require space. Even small character sets often map to characters from several different blocks in the Unicode Standard, and thus may contain up to 64K entries in at least one direction. Several techniques exist to reduce the memory space requirements for mapping tables. Such techniques apply not only to transcoding tables, but also to many other tables needed to implement the Unicode Standard, including character property data, collation tables, and glyph selection tables.
Flat Tables. If diskspace is not at issue, virtual memory architectures yield acceptable working set sizes even for flat tables because frequency of usage among characters differs widely and even small character sets contain many infrequently used characters. In addition, data intended to be mapped into a given character set generally does not contain characters from all blocks of the Unicode Standard (usually, only a few blocks at a time need to be transcoded to a given character set). This situation leaves large sections of the 64K-sized reverse mapping tables (containing the default character, or unmappable character entry) unused—and therefore paged to disk.
Ranges. It may be tempting to "optimize" these tables for space by providing elaborate pro-visions for nested ranges or similar devices. This practice leads to unnecessary performance penalties on modern, highly pipelined processor architectures because of branch penalties.
A faster solution is to use an optimized two-stage table, which can be coded without any test or branch instructions. Hash tables can also be used for space optimization, although they are not as fast as multistage tables.
Two-Stage Tables (Single index tries). Two-stage (high-byte) tables are a commonly employed mechanism to reduce table size (see Figure 5-1). They use an array of 256 pointers and a default value. If a pointer is NULL, the returned value is the default. Otherwise, the pointer references a block of 256 values. If full support for supplementary characters is required, three-stage tables (or double index tries) are a better solution.
Figure 5-1. Two-Stage Tables [TBD]
Optimized Two-Stage Table. Wherever any blocks are identical, the pointers just point to the same block. For transcoding tables, this case occurs generally for a block containing only mappings to the "default" or "unmappable" character. Instead of using NULL pointers and a default value, one "shared" block of 256 default entries is created. This block is pointed to by all first-stage table entries, for which no character value can be mapped. By avoiding tests and branches, this strategy provides access time that approaches the simple array access, but at a great savings in storage.
Given an arbitrary 64K table, it is a simple matter to write a small utility that can calculate the optimal number of stages and their width.
[additional guidelines, TBD]
The author wishes to thank Ken Whistler and Mark Davis for their insightful comments.
1 First version for public review
Copyright © 2000-2002 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.