Technical Reports |
Authors | Ken Whistler, Asmus Freytag (asmus@unicode.org) |
Date | 2004-10-27 |
This Version | http://www.unicode.org/reports/tr33/tr33-1.html |
Previous Version | None |
Latest Version | http://www.unicode.org/reports/tr33/ |
Revision | 1 |
Summary
This is the first working draft of a proposed Unicode Conformance Model
Status
This document is a Proposed Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
[Note to reviewers: Some sections are slated to be expanded in later drafts. In those cases, ed. notes indicate the content or direction of planned changes.]
The Unicode Standard [Unicode] is a very large and complex standard. Because of this, and because of the nature and role of the standard, it is often rather difficult to determine, in any particular case, just exactly what conformance to the Unicode Standard means.
The Unicode Standard forms the foundation which supports a large variety of operations on textual data, from data interchange protocols to complex tasks like sorting, rendering or content analysis. All of these processes expose implementations to the complexities of human languages and writing systems.
Earlier character sets were either small, or had a clearly limited field of application, (such as by geographical area), or both. By contrast, the Unicode Standard aims to be universal. A universal character encoding standard cannot rely on implicit agreements about the nature and behavior of the characters it encodes, it must provide explicit constraints on their identity and intended use. At the same time, the standard must allow implementations the necessary flexibility to address the expectations of its users, while providing enough constraints to guarantee predictable interchange of data and consistency between implementations.
This Conformance Model explains the issue of conformance relating to the Unicode Standard so that users better understand the contexts in which products are making claims for support of the Unicode Standard, and implementers better understand how to meet the formal conformance requirements while satisfying the expectations of their users. It does not alter, augment or override the actual Unicode conformance requirements found in the text of the Unicode Standard. Rather it attempts to provide a conceptual framework to make it easier for users and implementers to identify and understand the specific conformance requirements contained in [Unicode].
This model defines conformance terminology, specifies different areas and levels of conformance, and describes what it means to make a claim of conformance or "support" of the standard. This model is not a framework for conformance verification testing, although it could be used to develop such a framework, should that prove desirable. At this time no such framework has been developed by the Unicode Consortium, nor have any conformance verification tests been required or sanctioned.
Many of the concepts presented here are equally applicable to other standards developed by the Unicode Consortium, such as the Unicode Collation Algorithm [UCA], and the specifications for Unicode support in regular expressions [RegEx].
This section gives a basic introduction to the terminology that will be discussed in more detail in sections below.
In the context of formal standards, conformance refers to a set of rules or criteria whereby a relevant entity such as an element of information interchange, a device, an application, or a piece of hardware, can be evaluated as either meeting or not meeting the specification in the standard. In general, a formal standard will have a conformance clause or clauses, which will be stated in terms of conditionals, such as "X is in conformance with Y specification of this standard if Z", or modals, often in uppercase, such as "An X that conforms with Y specification of this standard SHALL Z". The modal verbs that standards language commonly associates with such statements are often carefully defined to avoid any ambiguities in interpretation. In common practice, they involve specialized usage of "SHALL" and "MUST" for requirements, but also "MAY" for permitted deviations and "SHOULD" for non-binding recommendations.
If a standard is complex, the conformance clause or clauses themselves may also be complex. Occasionally, a conformance clause may simply be stated along the lines of "X is in conformance with this standard if it follows the specification in section W" where section W may consist of hundreds of pages and constitute most of the rest of the standard.
The term compliance is often used synonymously with the term conformance and will be used that way in this model.
Formal standards often distinguish between normative and informative content. This distinction may be highly conventionalized, or even subject to rules specified in other standards, such as for ISO standards, or the distinction may be less formally maintained.
Normative content of a standard is content which is required for all of the conformance requirements to be meaningful. Typically a standard will have normative definitions for terms used in the rest of the specification, normative references to other standards or sources whose content is referred to indirectly, and normative clauses, specifications, or sections, which actually define the content of the standard to which the conformance clauses apply.
Informative content of a standard is all material which has been added for clarification, but which, in the judgment of the standard's maintainers, could in principle be omitted without materially affecting the specification to which the conformance clauses refer. If a standard is changed over time, the status of some particular content could change from informative to normative, or vice versa, depending on whether it became required for conformance or was no longer required for conformance.
In the context of the Unicode Conformance Model, conformance verification means an external (third party) determination that under some specified set of circumstances an entity meets one or more requirements of the conformance clauses of the standard. In other words, while conformance clauses are merely a logical statement of requirements, conformance verification implies the existence of conformance verification tests, that have been applied to entities in order to make such determinations.
The Unicode Consortium does not endorse a particular methodology for conformance verification.
A standard may include tests or "benchmarks" as part of the text of the standard, or as external documents associated with the standard. While there is some overlap in general usage of the terms "conformance test" and "conformance verification tests", a systematic distinction is drawn between the two in the Unicode Conformance Model.
A conformance test for the Unicode Standard is a list of data certified by the Unicode Technical Committee [UTC] to be "correct" with regard to some particular requirement for conformance to the standard. In some instances, as for example, the implementation of the bidirectional algorithm, producing a definitive list of correct results is difficult or impossible, and in such cases, a conformance test may consist of an implemented algorithm certified by the UTC to produce correct results for any pertinent input data. Conformance tests for the Unicode Standard are essentially benchmarks that someone can use to determine if their algorithm, API, etc., claiming to conform to some requirement of the standard, does in fact match the data that the UTC asserts define such conformance.
A conformance verification test for the Unicode Standard is a test, usually designed and implemented by a third party not associated with the Unicode Consortium or the UTC, intended to test a product which claims conformance to one or more aspects of the Unicode Standard, for actual conformance to the standard. Thus a conformance verification test is a test of a product. Such a test, may, of course, make use of one or more of the Unicode conformance tests to determine the results of its conformance verification.
In the context of the Unicode Conformance Model, the term support refers to a more generalized claim of intent to conform to one or another requirement of the standard. A claim of Unicode support may in fact be difficult to verify, because it can be vague in detail. However, at least it indicates in principle that the developer or user of an entity intends conformance. More specifically, support often refers to a claim of particular repertoire coverage. For example, an application may claim support for Unicode Greek. That should be interpreted as meaning that Unicode Greek characters will be handled in conformance with the standard, and that all other relevant aspects of processing of those characters with which that particular application is concerned, will be done in such a way as not to violate the conformance clauses of the standard.
The Unicode Standard is regularly versioned, as new characters are added. A formal system of versioning is in place, involving three levels of versions:
All three levels have carefully controlled rules for the type of documentation required, handling of the associated data files, and allowable types of change between versions. For more information about Unicode versioning see [Versions]. Other standards developed by the Unicode Consortium may use a single level versioning scheme.
Conformance claims clearly must be specific to versions of the Unicode Standard, but the level of specificity needed for a claim may vary according to the nature of the particular conformance claim being made. Some standards developed by the Unicode Consortium require separate conformance to a specific version (or later), of the Unicode Standard. This version is sometimes called the base version. In such cases, the version of the standard and the version of the Unicode standard to which the conformance claim refers must be compatible.
If a technical deficiency in the specifications of the Unicode Standard is identified, it may be corrected by a change in the next version, or, if sufficiently important, by a formal corrigendum. A corrigendum often applies to several earlier versions, but does not retroactively change them. Implementations can claim conformance to any of these versions with the given corrigendum applied. For more on corrigenda see [Versions].
This approach to corrigenda differs from the approach in other standards organizations, such as ISO.
Errata are used to describe other known defects in the text. Unlike corrigenda they cannot be referenced in a conformance claim. For more information on errata see [Errata].
Each version of the Unicode Standard, once published, is absolutely stable and will never change. Implementations or specifications that refer to a specific version of the Unicode Standard can rely upon this stability. If future versions of these implementations or specifications upgrade to a future version of the Unicode Standard, then some changes may be necessary.
Some formal standards are developed once and then are essentially frozen and stable forever. For such standards, stability of content and the corresponding stability of conformance claims is not an issue.
For a standard aimed at the universal encoding of characters, such stability is not possible. The standard is necessarily evolving and expanding over time, to extend its coverage to include all the writing systems of the world. And as experience in its implementation accumulates, further aspects of character processing are added to the formal content of the standard. This fundamentally dynamic quality of the Unicode Standard complicates issues of conformance, because of the continually expanding content to which conformance requirements pertain. This expansion is both an expansion in breadth by adding more characters, and scripts, and in depth by adding more aspects of character processing.
Invariance refers to those aspects of the content of the Unicode Standard that have been formally defined as unchangeable, even as the standard continues its development. The guarantee of the stability of the formal Unicode character names is a fairly trivial example. While in principle such names could be changed, and were changed once between Version 1.0 and Version 1.1, the [UTC] has determined that such changes are too disruptive and have too little benefit to be tolerated. Accordingly, the stability of character names has been promoted to the status of an invariant in the standard.
A further discussion of invariance and invariants can be found in [PropertyModel]. Invariants guard against change for the sake of change, or technological drift, but they also prevent the correction of clerical errors, which is not a negligible issue in a standard as large and complex as the Unicode Standard. For a current list of invariants and a discussion of the tradeoffs, see the Unicode Stability Policy for Character Encoding and Character Properties [Stability].
Conformance claims need to be distinguished in terms of their relationship to invariants and non-invariants in the standard because of their different risk levels for stability.
This section will serve as a guide to the particular way that the Unicode Standard expresses conformance requirements, both in terms of where they are located and how they are expressed. It also explores the peculiar aspects of conformance related to the synchronized status of the Unicode Standard and the independent but closely aligned International Standard ISO/IEC 10646, which has its own conformance clauses expressed using ISO conventions.
Chapter 3, "Conformance" of [Unicode] contains formal definitions of terms referenced in the conformance clauses. While modifications of these definitions between versions of the Unicode Standard have been, and will continue to be necessary, every effort is made to keep the numbering of the definitions stable. This makes it easier to maintain external specifications that cite a particular definition.
The conformance clauses in Section 3.2, "Conformance Requirements" of [Unicode] define the requirements for a conformant implementation. They are expressed in terms of the definitions, but also refer to additional specifications contained in Unicode Standard Annexes. While modifications of these clauses between versions of the Unicode Standard have been, and will continue to be necessary, every effort is made to keep the numbering of the clauses stable. This makes it easier to maintain external specifications that cite a particular clause.
A Unicode Standard Annex (UAX) contains part of the standard, published as a standalone document. The relation between conformance to the Unicode Standard and conformance to each of the Unicode Standard Annexes is spelled out in detail in Section 3.2, "Conformance Requirements" of [Unicode]. Some of the conformance clauses refer explicitly to specifications contained in UAXs, such as the Bidirectional Algorithm [Bidi] or Normalization Forms [Normalization]. Normative material in other UAXs is defined by any of the mechanisms described below.
Other standards developed by the Unicode Consortium have their own conformance model.
[Text for 3.4 TBD]
[Ed. Note: Mention that UCD.html defines which are normative properties. See also property model.]
Unicode algorithms are specified as a series of logical steps. In many cases, the input to the algorithm is a string of character properties: in other words, the results of the algorithm are identical for different input strings, as long as each input string maps to the same string of character property values. Conformance to a Unicode algorithm does not require repeating the steps as described, but rather requires achieving the same outputs for the same inputs. This provides the necessary flexibility for implementations to pursue optimizations. Whether or not conformance to a given algorithm is required by Unicode conformance, implementations claiming to implement one of these algorithms must do so in conformance with its specification.
Some algorithms provide explicit methods for tailoring, or customizing a general algorithm to the needs of a specific language, locality or application. Other algorithms simply describe the best default practice, and customization is assumed for any practical application. An example of this is the line breaking algorithm in [LineBreak]. Whether or not conformance to a given algorithm is required by Unicode Conformance, implementations claiming to implement one of these algorithms must disclose the use of tailoring or customization.
The Unicode Standard and ISO/IEC 10646 share the same repertoire of coded characters, including the character code position, character name and identity. However, the two standards differ in the precise terms of their conformance specifications. Any conformant Unicode implementation will conform to ISO/IEC 10646, but because the Unicode Standard imposes additional constraints on character semantics and transmittability, not all implementations that are compliant with ISO/IEC 10646 will be compliant with the Unicode Standard. For a detailed description see Appendix C, "Relationship to ISO/IEC 10646" of [Unicode].
There are several broad areas of application where Unicode Conformance makes specific types of requirements. Because not all applications and implementations cover all these areas, some aspects of Unicode conformance may not be applicable to them.
Unicode Technical Report #32: Assessing Unicode Support [UTR32] discusses ways to assess the support for the Unicode standard in several common implementation areas.
Representation covers all aspects of being able to express and transmit Unicode data. It is a requirement applicable to certain protocols (for example, XML), but might apply to the storage aspects of databases and other file formats as well. Conformant representation applies to correct use of encoding forms and encoding schemes, as well as the ability to represent all Unicode code points. In addition, issues related to [Normalization] are important.
Conformant transcoding between Unicode and all other, so-called legacy character encodings, retains the identity of the transcoded characters. In addition, it may claim to retain a specific normalization form for the converted data. See [Normalization]. [CharMapML] defines a format for expressing character mappings. Implementations may choose to conform to that format in order to be able to interchange mapping tables.
String processing covers all operations on Unicode texts that can be carried out without considering layout and specifically without considering fonts. String processing encompasses a large variety of operations including, but not limited to text segmentation, text parsing, handling regular expressions, searching, and sorting, as well as creating formatted text representation of data types. For a number of these operations model algorithms and other specifications exist to which an implementation may claim conformance, such as [UCA]. [RegEx], [Boundaries], [LineBreak].
Layout comprises all operations that go from backing store to displayed text. The same operations are run in reverse for selection. These operations are dependent on font data, but are considered separately from fonts because the same implementation typically can work with a range of different fonts. Some operations, such as suppressing the display of certain ignorable code points are typically handled by the layout system without involving fonts. Conformance issues for layout processes include reordering from logical to display ordering, as well as positional shape selection. For bidirectional reordering, conformance to [Bidi] is required. For positional shaping and script-specific layout, model algorithms exist, or are being developed for Arabic and Syriac, Devanagari, Tamil and other Indic Scripts, as well as Mongolian. While the requirements of high end typography typically exceed these script-specific specifications, conformance requires a relation between specific constructs in the writing system and corresponding character code sequences, so that these constructs can be interchanged reliably.
[Ed. note. Add example, e.g. use of ZW(N)J in Indic scripts. (expressing linguistic constructs).]
The Unicode Standard does not standardize the actual appearance of characters, but instead intends that they should be depicted within a customary range of design interpretations. Conformance to the Unicode Standard therefore primarily refers to those tables in the fonts that correlate character codes with the glyphs in the font, for example 'cmap' tables, and to claims of "coverage" of the Unicode repertoire by fonts.
Conformance-related issues for character input consists of coverage of Unicode repertoire, conversion of input to Unicode character values for storage, and consistency with the text models required for particular scripts and text layout. The entities here are mostly IME's and keyboards (drivers).
Unicode Technical Standard #18, Unicode Regular Expression [RegEx] is an example of a standard that has well defined levels of conformance. Each implementation can claim conformance to a specific level, and each level makes specific conformance requirements. By contrast, conformance to the Unicode standard is not organized into such discrete levels. However, there are some areas where the standard allows limited, or partial support of some requirements.
The Unicode standard explicitly does not require that all implementations support all Unicode characters. Any implementation may support an arbitrary subset of Unicode characters, and in fact, may support different sets of characters for different operations.
However, for certain algorithms, any implementation that claims conformance is required to support the full range of Unicode code points covered by that algorithm. For example, an implementation of normalization, or a UTF-8 converter is required to support the entire range of Unicode code points.
Note: an implementation may define an algorithm, such as identifier matching, that uses normalization as part of the algorithm but also restricts the allowable set of input characters. In that case, any implementation of that algorithm is free to use a limited implementation of normalization, because the limit on the input makes it impossible to distinguish between a full and limited implementation of normalization.
[Ed. Extending the notion of covering a repertoire: Interpretation of sequences.]
This and the next section consider conformance separately for each of the major areas of Section 4. Full conformance in a given area is not necessarily the same as full support for that area, as conformance requirements in many cases are minimal requirements. Exceptions are certain well-defined areas such as encoding forms or normalization that have few or no options and few or no levels.
This section will provide both a typology for levels of conformance in an area, by presenting an alternative to the notion that all aspects of Unicode conformance are either/or issues together with specific lists of levels of conformance and support where they can be pulled out of the standard.
[Ed.: For example, the standard explicitly talks about levels of surrogate support, which is an example that should be abstracted, along with others, to provide the basis for determining how to make various claims of conformance.]
[Ed. This section could describe best practices of deciding levels of conformance or it could describe how conformance requirements relate to best practices in a given area.]
[ The following content is just sketched out in outline form. Could also cover what should be tagged with a Unicode version and when.]
Conformant implementations will have to interact with both down-level and up-level implementations. This creates particular issues. The Unicode conformance requirements are structured to encourage implementations to passively support data containing characters assigned in future versions of the standard.
[Ed. Note: Describe any additional strategies the standard follows. What are implementation strategies?]
For several important properties, [Unicode] provides explicit support for implementations that need to be compatible with a down-level version of the relevant algorithm. This is usually done by guaranteeing the stability of property assignments [Stability]. In some cases, specific properties are introduced that isolate an algorithm from changes in a character's General Category. For an example, see the section on backwards compatibility in UAX#31: Identifier and Pattern Syntax [Identifier].
[Ed. Note: Describe any additional strategies the standard follows. Are there other implementation strategies?]
For most properties, there is a single default value that down-level systems can apply to unassigned characters when present in data sent from up-level systems. Where the fallback represented by such default value would give particularly poor results, the [UCD] or [Unicode] provide for several ranges with different default values. Such default values increase the chance that an actual property assigned to a new character will be the same as the default value for its code point in the down-level version of [Unicode]. An example are the [Bidi] properties, which default to strong right-to-left for areas of the code space earmarked for RTL scripts.
A common implementation technique is to use dynamic assignment of implementation specific default values, based on the actual property values of characters surrounding an unassigned code point. Such interpolation of character properties can further increase the chance that any given code point is treated compatibly by a down-level system.
It is generally not helpful to tag data created by an implementation with the version level of Unicode supported by that implementation. Because the repertoire of that version of Unicode is far larger than the actual set of characters used in the data, a large part of text data created and interchanged worldwide can be represented in all versions of Unicode. Therefore, the version level of the implementation bears little relation to the repertoire needed to cover the data.
Most implementations will not equally support the entire repertoire of Unicode characters for a given version. In fact, there is no conformance requirement to support any specific part of the repertoire. Therefore, even if the version level of a receiving implementation is higher than that of the creating implementation there is no guarantee that both support the repertoire covered by the data, or support it equally well.
[Unicode] defines no method for enumerating or identifying common sub-repertoires of the standard, but ISO/IEC 10646 does so. Implementations can use the [DerivedAge] for each character code to avoid sending character codes to a down-level system which lacks a definition for them. Because character coding is strictly additive, implementations receiving data can easily identify characters that are not defined in the version of the standard to which they conform and take appropriate action. In many cases, appropriate action consists of passing through such data, or treating them as characters possessing default properties. (See UTR#23: Unicode Character Property Model [PropertyModel] for more details on default properties).
[Ed. note: Additional input on differentiating implementations into input andoutput, extending the repertoire from characters to sequences. Definde the responsibility of people assembling systems from systems conformance of the whole form conf of the parts..]
A mere matching of version numbers between an implementation and components it relies on will not be sufficient, because components may subset the repertoire they support or choose a different level of conformance, where available.
[10646] | International Organization for Standardization.
Information Technology--Universal Multiple-Octet
Coded Character Set (UCS). (ISO/IEC
10646:2003). For availability see http://www.iso.org |
[14651] | International Organization for Standardization. Information
Technology--International String ordering and comparison--Method for
comparing character strings and description of the common template tailorable
ordering. (ISO/IEC 14651:2001). For availability see http://www.iso.org |
[Bidi] | Unicode Standard Annex #9: The Bidirectional Algorithm http://www.unicode.org/reports/tr9/ |
[Boundaries] | Unicode Standard Annex #29: Text Boundaries http://www.unicode.org/reports/tr29/ |
[CharMapML] | Unicode Technical Standard #22:
Character Mapping Markup Language (CharMapML), http://www.unicode.org/reports/tr22/ |
[Charts] | The online code charts can be found at http://www.unicode.org/charts/ An index to characters names with links to the corresponding chart is found at http://www.unicode.org/charts/charindex.html |
[DerivedAge] | The version for which a given character
was added to the Unicode Standard is listed in: http://www.unicode.org/Public/UNIDATA/DerivedAge.txt |
[Errata] | Updates and errata to the Unicode Standard, as well as other technical standards developed by the Unicode Consortium can be found at http://www.unicode.org/errata |
[Feedback] | Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html |
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues. |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[Identifier] | Unicode Standard Annex # 31:
Identifier and Pattern Syntax, http://www.unicode.org/reports/tr31/ |
[LineBreak] | Unicode Standard Annex #14: Line Breaking Properties http://www.unicode.org/reports/tr14/ |
[Normalization] | Unicode Standard Annex #15: Normalization Forms http://www.unicode.org/reports/tr15/ |
[Property Model] | Unicode Technical Report #23, The Unicode Character Property Model, http://www.unicode.org/reports/tr23/ |
[RegEx] | Unicode Technical Standard #18: Unicode Regular Expressions, http://www.unicode.org/reports/tr18/ |
[Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[Stability] | Unicode Stability Policy for Character Encoding and Character Properties http://www.unicode.org/standard/stability_policy.html |
[UCA] | Unicode Technical Standard #10: Unicode Collation Algorithm,
http://www.unicode.org/reports/tr10/ |
[UCD] | Unicode Character Database,
http://www.unicode.org/ucd/
For an overview of the Unicode Character Database and a list of its associated files |
[Unicode] | The Unicode Standard For the latest version see: http://www.unicode.org/versions/latest/. For the last major version see: The Unicode Consortium. The Unicode Standard, Version 4.0. (Boston, MA, Addison-Wesley, 2003. 0-321-18578-1) or online as http://www.unicode.org/versions/Unicode4.0.0/ |
[UTC] | The Unicode Technical Committee, see http://www.unicode.org/consortium/utc.html for more information on procedures etc. |
[UTR32] | Unicode Technical Report #32:
Assessing Unicode Support, http://www.unicode.org/reports/tr32/ |
[Versions] | Versions of the Unicode Standard,
http://www.unicode.org/standard/versions
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports. |
Thanks to Dr. Julie Allen for extensive copy-editing.
The following summarizes modifications from the previous version of this document.
1 | Initial proposed Draft. [AF] |
Copyright © 2001–2004 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.