Technical Reports |
Version | 2 |
Authors | Mark Davis (mark.davis@us.ibm.com) |
Date | 2002-08-15 |
This Version | n/a |
Previous Version | n/a |
Latest Version | n/a |
Tracking Number | 1 |
This document describes guidelines for testing programs and systems to see if they support Unicode, and the level of support that they offer.
[Boilerplate here]
In today's world, software components must interact with a wide variety of other components. Systems often consist of components running on different machines and different platforms, all communicating with one another in complex ways. Unicode is fundamental to providing seamless support of all world languages, and it will appear in many different products: from operating systems to databases, from digital cameras to online games. When assembling systems, it is crucial to be able to ensure that all the different components of a system support Unicode; otherwise weak links may degrade or disable the internationalization support offered by the system as a whole.
This document describes guidelines for testing programs and systems to see if they support Unicode, and if so, to determine the level of support that they offer. These guidelines explicitly do not test for general internationalization or localization capabilities; those are out of scope for this document.
Because Unicode is such a fundamental technology, any tests for Unicode capabilities must be tailored to the specific type of product. Moreover, many of the requirements for Unicode support are only applicable to particular products. BIDI conformance, for example, may not be applicable if the product never displays text, but only processes it. Thus all of the following guidelines can only be applied to products that support or require the relevant kinds of processing described in each section.
Any assessment of Unicode support must start with conformance to the Unicode Standard itself. The Unicode Standard is a very large and complex standard. Because of this, and because of the nature of the material in the standard, it is often rather difficult to determine, in any particular case, just exactly what conformance to the Unicode Standard means. People have raised issues regarding this difficulty, both from a theoretical point of view, and from the practical standpoint of determining what products "support" Unicode, and what such claims of support actually mean.
A conformance test for the Unicode Standard is a list of data certified by the UTC to be "correct" in regard to some particular requirement for conformance to the standard. In some instances, as for example, the implementation of the bidirectional algorithm, producing a definitive list of correct results is difficult or impossible, and in such cases, a conformance test may itself consist of an implemented algorithm certified by the UTC to produce correct results for any pertinent input data. Conformance tests for the Unicode Standard are essentially benchmarks that someone can use to determine if their algorithm, API, etc., claiming to conform to some requirement of the standard, does in fact match the data that the UTC claims defines such conformance.
Some formal standards are developed once and then are essentially frozen and stable forever. For such standards, stability of content and the corresponding stability of conformance claims is not an issue. For a large, complex standard aimed at the universal encoding of characters, such as the Unicode Standard, such stability is not possible. The standard is necessarily evolving and expanding over time, to extend its coverage of all the writing systems of the world. And as experience in its implementation accumulates, further aspects of character processing also accrue to the formal content of the standard. This fundamentally dynamic quality of the Unicode Standard complicates issues of conformance, since the content to which conformance requirements pertain continually expands, both horizontally to more characters and scripts, and vertically to more aspects of character processing.
The Unicode Standard is regularly versioned, as new characters are added. A formal system of versioning is in place, involving major, minor, and update versions, all with carefully controlled rules for the type of documentation required, handling of the associated data files, and allowable types of change between versions. For more information about the details of Unicode versioning see [Versions]. Conformance claims clearly must be specific to versions of the Unicode Standard, but the level of specificity needed for a claim may vary according to the nature of the particular conformance claim being made.
Because the criteria for conformance to the Unicode Standard apply to a wide range of possible systems, they sometimes do not require the same level of quality in behavior or display that one would require of a production system. For example, a badly-drawn low-resolution depiction of an 'a' is conformant, but would not be acceptable in practice. In the following document, guidelines and tests do go beyond what the conformance clauses of the standard strictly require.
In many cases below, precise tests cannot be formulated, since the types of processes are so varied. In such cases, for example for Protocols, examples are given of the types of tests that can be formulated.
In a number of cases, examples are formulated using code snippets. These snippets are only examples; there is no implication about the use of any particular programming language, nor that any particular syntax is better or worse than any other; these are only examples.
The most fundamental requirements for Unicode support are the following:
Canonical equivalences is defined in the Unicode Standard: two strings are canonically equivalent when their NFD transformations are identical. For example, all of the following strings are canonically equivalent. (Note: the semicolon represents a real character in the string, while a comma is just for display, to show a sequence of characters. Character names are given in parentheses.)
It is not always clear what the requirements for canonical equivalence on processes are. Essentially it is that process (or function) respects canonical equivalence when canonical equivalent inputs always produce canonically equivalent outputs.
Note:
The "canonically equivalent" inputs or outputs are not just limited to strings, but are also relevant to the offsets within strings, since those play a fundamental role in Unicode string processing.
A Unicode string is simply an ordered sequence of Unicode code units. An offset into a string is a number from 0 to n, where n is the length of the string, and indicates a position that is logically between code units (or at the very front or end in the case of 0 or n). Not only can we speak of two strings X and Y being canonically equivalent, we can also speak of two offsets P and Q (into X and Y) as being canonically equivalent; they are if the substring of X up to P and the substring of Y up to Q are canonically equivalent. Note that the length of a string is also an offset.
Example:
Given that, we can now provide three examples of processes that involve canonically equivalent strings and/or offsets.
Examples
toLower(string)
respects canonical equivalence
then if toLower(
<A-ring>)
equals <a-ring>,
then toLower(
<A,ring>)
must be
canonically equivalent to <a-ring>, such as <a-ring>
or <a, ring>.isUpper(string)
respects canonical equivalence
then if isUpper(
<A-ring>)
is true,
then isUpper(
<A, ring>)
must be
true.isWordBreak(string, offset)
respects canonical
equivalence, then if isWordBreak(
<A-ring,
@>, 1)
is true, then isWordBreak(
<A,
ring, @>, 2)
must be true.nextWordBreak(string, offset)
respects canonical
equivalence, then if nextWordBreak(
<A-ring,
@>, 0)
equals 1, then nextWordBreak(
<A,
ring, @>, 0)
must be equal to 2.Notice in the last case that the two numeric values of the input offsets and output offsets are not identical; they are canonically equivalent.
Respecting canonical equivalence is different than preserving normalization. In a process that preserves a canonical normalization form X (NFC or NFD) (see UAX #15 [Reports] for information on NFC and NFKC), whenever any input string is normalized according to X, then every output string is also normalized according to X. If a process preserves a canonical normalization form, then it respects canonical equivalence, but not vice versa.
In building a system that as a whole respects canonical equivalence, there are two common strategies:
There are trade-offs for both of these strategies.
Strategy A is more robust, in general. With strategy B, any component that fails can "leak" unnormalized text into the rest of the system. This can be ameliorated somewhat by including in the flow of text through the system certain 'gatekeeper' components, components either check or force normalization.
Strategy B can be more efficient. Since each component is assured that all of its input is in a particular normalization form, processing within the component does not need to worry about normalization. For more information, see UTN #5 Canonical Equivalence in Applications.
With either strategy, the components must be tested to ensure that they meet the proper conditions. For more information on testing canonical equivalence and normalization, see 8 Transformations.
Note: A Unicode string datatype is simply an ordered sequence of code units. Such a sequence may or may not be always required to be in correct UTF format, depending on the environment they are used in. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily valid UTF sequences. Because of the structure of UTF-16, it is more efficient to allow those strings to contain invalid UTF-16 sequences, such as isolated surrogates, in normal processing.
However, whenever such strings are converted into a UTF sequence for storage or interchange, they must be checked for validity so that they do not violate that UTF format requirements.
In the tests referenced in this document, some test cases may have non-UTF format strings to test edge cases. If an implementation imposes the condition that all strings be in a UTF format, then those test cases should be skipped.
Tests for Roundtrip testing of these are fairly straightforward with components that store and retrieve data, such as databases. Here is one example:
Build a small table, insert Unicode data, select the data from the table and compare the results. For instance, use the following SQL statements to create a table named "langs", insert data, select all data and search for one record:
SQL Statements | Results |
---|---|
drop table langs; |
The SQL command completed successfully. |
create table langs (L1 character(10), L2
varchar(18)); |
The SQL command completed successfully. |
insert into langs values ('Russian', ' русский'); |
The SQL command completed successfully. |
insert into langs values ('Spanish', ' Español'); |
The SQL command completed successfully. |
insert into langs values ('Czech', ' čeština'); |
The SQL command completed successfully. |
insert into langs values ('Greek', ' ελληνικά'); |
The SQL command completed successfully. |
insert into langs values ('Japanese', ' 日本語'); |
The SQL command completed successfully. |
insert into langs values ('Vietnamese', ' Tiểng
Việt'); |
The SQL command completed successfully. |
select * from langs; |
L1
L2 русский Español čeština ελληνικά 日本語 Tiểng Việt
|
select * from langs where L2 like '% λη%'; |
L1
L2 ελληνικά
|
This section is optional, since not every product does--or needs to do--conversion. However, if a product does do conversion, here are the areas to test for:
For ISO/IEC 8859 tests, download the files in http://www.unicode.org/Public/MAPPINGS/ISO8859/. The files are of the following format, with two significant fields: the first is a byte, and the second is a code point.
0x00 0x0000 # NULL ... 0xFF 0x00FF # LATIN SMALL LETTER Y WITH DIAERESIS |
For compression tests and UTF tests (and if CESU-8 is supported), for each converter:
Bytes | Encoding | Code Points |
EF BB BF E1 88 B4 | UTF-8 | 1234 |
EF BB BF E1 88 B4 | UTF-16/LE/BE | EFBB BFE1 88B4 |
EF BB BF E1 88 B4 | UTF-32/LE/BE | error |
FE FF 12 34 | UTF-16 | 1234 |
FE FF 12 34 | UTF-16BE | FEFF 1234 |
FE FF 12 34 | UTF-16LE/UTF-32* | error |
FF FE 34 12 | UTF-16 | 1234 |
FF FE 34 12 | UTF-16LE | FEFF 1234 |
FF FE FF FE 34 12 | UTF-16 | FEFF 1234 |
FE FF FE FF 12 34 | UTF-16 | FEFF 1234 |
Unlike some of the other sections, no specific tests are available for this section. Moreover, only general guidelines can be described for protocols, including the following:
SMTP (with/without MIME) is given as a simple example. For SMTP, there are sending and receiving clients easily available: email applications like Outlook Express and Netscape Messenger.
Sample text for the email body:
Latin: U+00FE ð
Cyrillic: U+0436 ж
Arabic: U+0628 ب
Hindi: U+0905 अ
Hiragana: U+3042 あ
Han Ideograph: U+4E0A 上
Deseret (plane 1): U+1040C А?
Han Ideograph (plane 2): U+20021 ࠀ?
Verify that the email contents is preserved when stored+forwarded through
this server.
Requires proper configuration of the email client/network.
In this case, as with some other protocols, the server will almost always just
pass the contents through. The test will thus just verify that the server is
8-bit clean, which is almost always the case.
Send the email to an address that is handled by a particular client
program. Make sure that the text is fully preserved and displayed in a
reasonable way (given available fonts etc.).
Example for email clients that are expected to have problems in this area:
Eudora, Netscape 4.x (do not use Unicode internally, so must convert to
subset-charsets).
Some email systems (Lotus Notes, X.500, VM) use other protocols than SMTP and transform emails between SMTP and their own formats. Send emails into such systems, forward/reply them back to a globalization-capable client and verify full roundtrip of the text. The following are examples of gateways/systems that are expected to have problems: VM (EBCDIC encodings have a subset of the Unicode repertoire).
Note: Lotus Notes is globalization-capable (should pass the test) because LMBCS can encapsulate Unicode; it will not fully roundtrip arbitrary MIME/HTML formatting, but this is out of scope for G11N certification. All of the characters should roundtrip.
A non-SMTP email client would have to get the test email through such a gateway. It is possible that the client may show an email with higher or lower fidelity compared with the roundtrip test into and out of the gateway. Higher if only the second part of the roundtrip were to lose information. Lower if the roundtrip can preserve some or all of the original contents in a form that is not displayed in the non-SMTP client.
With many IETF (Internet) protocols it is possible to test at least some of
the protocol elements using a telnet client or a special-purpose client (e.g.,
Java application reading/writing to sockets) by reading and writing plain text
streams directly, and using UTF-8 text for the contents.
Generally, it may be necessary to write custom test clients/servers to perform
meaningful tests of a protocol at all or to automate such tests.
Some protocols (like HTTP) allow many more charsets in direct use than SMTP.
"Direct use" means that UTF-16 is possible in SMTP emails only after
a base64-transformation (or quoted-printable), while HTTP allows the contents
to be encoded in UTF-16 directly in the byte stream.
SOAP is an XML vocabulary being defined by W3C. A SOAP message consists of SOAP envelope, SOAP header and SOAP body. The SOAP body contains user data which is used for RPC function.
<SOAP-ENV:Envelope> <SOAP-ENV:Header> Additional Information for SOAP message transmission </SOAP-ENV:Header> <SOAP-ENV:Body> Body data of SOAP-RPC message transmission </SOAP-ENV:Body> </SOAP-ENV:Envelope>
SOAP request between service requester and UDDI service provider
POST /uddisoap/publishapi HTTP/1.1 Host: abc.def.com Content-Type: text/xml; charset=utf-8 Content-Length: nnn SOAPAction: "" <?xml version="1.0" encoding="UTF-8" ?> <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"> <SOAP-ENV:Body> <save_business generic="2.0" xmlns="urn:uddi-org:api_v2"> <authInfo>uddiUser</authInfo> <businessEntity businessKey=""> <name xml:lang="ru">русский</name> <name xml:lang="cs">čeština</name> <name xml:lang="el">ελληνικά</name> <name xml:lang="ja">日本語</name> <name xml:lang="vi">Tiểng Việt</name> </businessEntity> </save_business> </SOAP-ENV:Body> </SOAP-ENV:Envelope>
The expected SOAP message from UDDI service provider for Example 1 above.
HTTP/1.1 200 OK Server: ABC Content-Type: text/xml; charset="utf-8" Content-Length: nnnn Connection: close <?xml version="1.0" encoding="UTF-8" ?> <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"> <SOAP-ENV:Body> <businessDetail generic="2.0" xmlns="urn:uddi-org:api_v2" operator="operator"> <businessEntity businessKey="14821BDD-00EA-4398-8003-24BC35F0394A" operator="operator" authorizedName="uddiUser"> <discoveryURLs> <discoveryURL useType="businessEntity">http://abc.def.com:9080/uddisoap/get?businessKey=14821BDD-00EA-4398-8003-24BC35F0394A </discoveryURL> </discoveryURLs> <name xml:lang="ru">русский</name> <name xml:lang="cs">čeština</name> <name xml:lang="el">ελληνικά</name> <name xml:lang="ja">日本語</name> <name xml:lang="vi">Tiểng Việt</name> </businessEntity> </businessDetail> </SOAP-ENV:Body> </SOAP-ENV:Envelope>
Java program (SaveBusinessExample.java) using SOAP interface in UDDI4J to generate sample 1.
Programming Language support includes both the basic programming language, and libraries that supplement the basic support with additional functionality. Thus, for example, even though the basic support in C for Unicode is fairly rudimentary, there are supplementary libraries that provide full-featured Unicode support. While it would be more efficient and more interoperable if C had the capabilities discussed in section 5.1 below, it is certainly possible to work around those limitations in providing Unicode support based on C.
The fundamental requirements for good programming language support include the following. The names of the datatypes are not important; the ones given below are only examples.
UTF-32 datatype: A Unicode code point datatype whose value space is entire repertoire of Unicode, from U+0000 to U+10FFFF. For example:
UTF32_t cp = '\U0001D434'; UTF32_t* s = "ABC \U0002F884";
UTF-16 datatype: With the wide variety of Unicode libraries and operating system functions using 16-bit Unicode strings, for interoperability it is incumbent upon a programming language to also supply a UTF16 datatype, one that contains 16 unsigned bits. For example:
utf16_t cp1 = '\u1234'; utf16_t* s = "ABC \u5678";
Literals: Although not formally necessary (since any localizable text will be in resources), for full support a language will provide character literal and string literal representations with UTF32 and UTF16 literals, as in the above examples.
Testing for full internationalization support is beyond the scope of this document, but the language (supplemented by libraries) can be tested for the following.
x = "αβγ"
,
using the actual characters instead of hex codes.Integrated Development Environments (IDEs) or the the tools that can be part of a development environment, are subject to the general requirements of Section 9 Keyboard_Input and Section 10 Rendering. Particular features to watch for are:
The StringTest.txt file contains machine-readable tests for code point operations. These can be used for iteration and extraction.
To test to see if all Unicode identifiers are supported, access the DerivedCoreProperties.txt file. Write a file that has each of the following strings in the context of an identifier. Verify that the resulting file (or files: they may need to be broken up to get past compiler memory limitations) can be successfully compiled and linked.
This is also a good test of string literals. For example:
utf16_t* a = "a"; // U+0061 ( a ) ... utf16_t* α = "α"; // U+03B1 ( α ) ... utf16_t* 串 = "串"; // U+4E32 ( 串 ) ... utf16_t* = ""; // U+10400 ( ) ... utf16_t* a̖ = "a̖"; // U+0061 ( a ) U+0316 ( ◌̖ ) ...
Analysis includes character properties, regular expressions, and boundaries (grapheme cluster, word, line sentence breaks). In this area, typically the tests will check against the UCD properties, plus the guidelines for how those properties are used. The exact formulation of the test will depend on the API and language involved.
Caution: For the English language, functions such as
isLetter()
are sufficient for a variety of tasks, such as word-break. With the wide variety of languages, scripts, and types of characters supported by the Unicode Standard, this is not true. The presence of non-spacing marks in Arabic, for example, will cause any naïve use of such functions to give incorrect results. More sophisticated mechanisms must be used for determining such tasks.Note: It is perfectly conformant to supply additional, tailored behavior (such as the results of property APIs, or different word breaks) that is different than the Unicode default behavior, as long as such behavior does not purport to follow the Unicode default specifications.
The main features to test for are the following.
Alphabetic = true
Numeric_Type = Digit
For testing Unicode properties, a small test program should be written that for each property:
For regular expressions, UTR #18 provides 3 levels for regular expressions. The feature sets in these levels can be tested for explicitly. Note: the TR does not require any particular syntax, so any tests have to be adapted to the syntax of the regular expression engine.
For case detection, test with the following file [TBD]. In addition, verify that the functions respect canonical equivalence by applying all functions to each field in NormalizationTest.txt, and verifying that the same answer is produced.
For grapheme-cluster, word, line and sentence boundaries, the following tests can be used.
Common APIs will test a particular offset to see if it is a boundary, and also iterate (e.g. find the next boundary). Verify that both APIs provide the same results on all of the test cases, by iterating over each test case, and independently determining the boundaries one at a time, then comparing the two sets of results.
The Well-Formed tests, as mentioned above, are rarely worth testing for. However, if those features are important for a particular application, the following can be used.Comparison includes both binary comparison, and comparison based on UCA (UTS #10). In the latter case, it includes string comparison, string search, and sortkey generation. In the case of Collation (and the related StringSearch), only the default collation ordering can be tested, since there is no accepted repository of machine-readable tailorings for different languages.
Binary comparison works by lexically comparing strings. The first string unit difference "wins". Unicode has three encoding forms for processing: UTF-8/16/32. Software needs to be able to perform either comparison regardless of its native Unicode encoding form to achieve the same binary order for sorted data structures (lists, trees, etc.) as other software in a connected system. For example, Java Servlets vs. UTF-8 database.Note: It is perfectly conformant to supply additional, tailored behavior (such as the results of collation ordering) that is different than the Unicode default behavior, as long as such behavior does not purport to follow the Unicode default specifications.
<string1> ; <string2> ; <code point relation> ; <UTF-16 relation>
For example:
0061; 0062; LESS; LESS; FFFF; 10FFFF; LESS; GREATER; FFFF; 10FFFF; LESS; GREATER;
Collation (UCA UTS #10): If the process purports to support the UCA, verify the default collation sequence using the test files in http://www.unicode.org/unicode/reports/tr10/#Test. If both sortkey generation and
String Search: Verify that the locale-sensitive string search functions follow the UCA, according to StringSearchTest.txt. Note: This needs to be fleshed out more.
Case-Insensitive Compare: Verify that that the results follow the guidelines in UAX #21
Transformations are functions that take a string as input, and produce a (perhaps) modified string as output. They include case conversion and normalization. These are described in detail in UAX #21 and UAX #15.
Note: It is perfectly conformant to supply additional, tailored behavior (such as the results of case folding) that is different than the Unicode default behavior, as long as such behavior does not purport to follow the Unicode default specifications. However, it must be clear to programmers and end users that the default Unicode behavior is not being followed.
The following tests can be used.
The main goals of keyboard input and editing tests are to verify that:
For more detailed specific tests of input, see the end of Section 10.1 Rendering Tests.
The goals of rendering tests are to verify that for the repertoire supported by the product:
Testing rendering behavior is not generally possible programmatically. There is simply too much variation in the possible acceptable behavior. Moreover, if a system is not documented as supporting a given repertoire of characters (such as Hebrew), then tests of that repertoire are not applicable. The following, however, does provide some guidelines in assessing correct behavior for a supported repertoire.
One of the key features to test for is whether the visual appearance is legible and does reasonably reflect the correct sequence of code points in memory. This is especially important for BIDI scripts and other complex scripts such as Indic.
Legibility and sequencing is also important with non-spacing marks in general. There is a very wide degree of variation in the position, size and shape of non-spacing marks; rendering is only really unacceptable when it would lead a user to conclude that the non-spacing mark is on the wrong character, or if the rendering is of such low quality that the non-spacing mark is not visible. For more information on accent placement, see UTN #2.
Another feature to test for is that canonically equivalent sequences of accents should display the same. The test cases cited in the Conformance_Testing of UAX #15 (filtered to the repertoire supported by the system or font), can be used to test this.
Code Point Sequence |
Unacceptable Rendering | Acceptable Rendering | ||
---|---|---|---|---|
Preferred | Fallback | |||
U+006C, |
||||
U+006C, |
||||
U+006F, |
For detailed sample tests of support for different complex scripts, both rendering and input/editing, see the following:
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/unicode/faq/ For answers to common questions on technical issues. |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[Reports] | Unicode Technical Reports http://www.unicode.org/unicode/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[U3.1] | Unicode Standard Annex #27: Unicode 3.1 http://www.unicode.org/unicode/reports/tr27/ |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/unicode/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them. |
Thanks to Helena Shih Chapman, Julius Griffith, Markus Scherer, Baldev Soor, Akio Kido, Kentaroh Noji, Takaaki Shiratori, Xiao Hu Zhu, Geng Zheng, CP Chang, Matitiahu Allouche, Tarek Abou Aly, Ranat Thopunya, and Israel Gidali for their many contributions to this document.
The following summarizes modifications from the previous version of this document.
Copyright © 2002 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.