Technical Notes |
Version | 1 |
Authors | Markus Scherer |
Date | 2004-01-13 |
This Version | http://www.unicode.org/notes/tn12/tn12-1.html |
Previous Version | [none] |
Latest Version | http://www.unicode.org/notes/tn12/ |
This document attempts to make the case that it is advantageous to use UTF-16 (or 16-bit Unicode strings) for text processing. It is most important to use Unicode rather than older approaches to text encoding, but beyond that it simplifies software development even further to use the same internal form for text representation everywhere. UTF-16 is already the dominant processing form and therefore provides advantages.
This document is a Unicode Technical Note. It is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. For general information on Unicode Technical Notes, see http://www.unicode.org/notes/.
Unless the distinction is particularly important, I use the term “UTF-x” to mean “UTF-x or x-bit Unicode strings” for brevity, and because “UTF-x” is the older and more familiar term compared with “x-bit Unicode strings” [U4ch2].
More important than the question of the preferred form of Unicode is to use Unicode at all: The most important lesson of several decades of handling text in software is to use a single, universal coded character set. Originally, a single legacy character set was assumed, which limited software to single markets. The POSIX model of character set agnosticism was an improvement but made it hard to optimize for efficient text processing, and nearly impossible to handle truly multilingual documents or server data. Switching between different processing charsets requires specific text handling functions and character properties databases for each supported charset. Direct programming for Unicode allows to develop and use optimized libraries and also to hardcode critical paths without restricting the reach of the software.
Important: There are multiple encoding forms for Unicode. The standard [U4ch2] defines the UTF-8, UTF-16 and UTF-32 encoding forms for processing (as well as related encoding schemes for data exchange). It also defines 8/16/32-bit Unicode strings that are simply vectors of UTF-8/16/32 code units, i.e., such strings need not contain well-formed UTF-8/16/32 sequences during processing. All of these are Unicode. They are simply different ways to deal with the same character set and repertoire in software. All of them work.
For a good overview and illustration of the Unicode character model and encoding forms see Forms of Unicode [FormsU].
The following sections briefly compare the Unicode encoding forms for processing, but the main argument of this document is that consistent use of the same form simplifies the development of software and related standards.
The original Unicode design was for a fixed-width 16-bit encoding. Unicode 2.0 extended the architecture, and later Unicode versions assigned supplementary characters, in a way that is designed to maintain high performance with 16-bit processing. (See for example “International Programming with Unicode Surrogates” [ProgSurr].)
The optimization for 16-bit units consists mainly in the assignment of all commonly used characters and format controls on the BMP, reachable with one 16-bit unit. Therefore, supplementary code points are very rare in most text.
Important: Supplementary code points must be supported for full Unicode support, regardless of the encoding form. Many characters are assigned supplementary code points, and even whole scripts are entirely encoded outside of the BMP. The opportunity for optimization of 16-bit Unicode string processing is that the most commonly used characters are stored with single 16-bit code units, so that it is useful to concentrate performance work on code paths for them, while also maintaining support and reasonable performance for supplementary code points.
From a programming point of view it reduces the need for error handling that there are no invalid 16-bit words in 16-bit Unicode strings. By contrast, there are code unit values that are invalid in 8/32-bit Unicode strings. All pairs of lead/trail surrogates in UTF-16 represent valid supplementary code points, and reading 16-bit Unicode requires to look ahead at most one unit.
Adapting existing software — where 8-bit strings were used — to 16-bit Unicode strings requires to add or modify APIs. On the other hand, good Unicode support requires at least a review of legacy code, and often modifications to eliminate limiting assumptions. A different string type for Unicode actually provides a useful demarcation between codepage-agnostic and Unicode-aware parts of software.
UTF-8 was mainly designed to store Unicode filenames in an ASCII-friendly way. It is suitable for processing, but it is significantly more complex to process than UTF-16. Lead bytes have a relatively complex encoding, and up to three trail bytes (or five to cope with the original definition) must be counted, read and range-checked, then the resulting code point must be range-checked as well. (It is possible to create a controlled environment where input strings are checked for correct UTF-8 encoding and string operations guarantee to maintain it; only in such an environment can a different set of processing functions be optimized to avoid the per-byte error checking.)
UTF-8 is often preferred for storage and data exchange, removing a conversion step if it is also used for processing. However, Unicode software almost always interfaces with legacy applications and data and needs to be prepared for conversion anyway. UTF-8 stores Latin text in a compact form compared to UTF-16 but does not provide any advantages for other scripts or even uses more space. There are other Unicode charsets (for example, SCSU and BOCU-1) that are more efficient for storage and data exchange.
UTF-8 can be necessary to move Unicode strings through existing “agnostic” or only ASCII-aware APIs. On the other hand, programming for Windows has shown that separate APIs with a different text data type provide an easy way to separate legacy encoding data from Unicode and make it obvious where conversions are needed.
Conventional wisdom suggests that Unix/Linux software always uses UTF-8 for Unicode processing (or UTF-32 with wchar_t), but the list below shows that software with good Unicode support tends to use UTF-16 even there.
UTF-32 has a trivial structure, but half of the memory usage and bandwidth is wasted, which decreases performance.
Fixed-width processing for all code points including supplementary ones is good for very low-level algorithms, but multi-code point strings must be treated as units in many cases in order to deal with “user characters”. UTF-16 already allows fixed-width processing of BMP characters.
For software development it is best to use the same internal form everywhere to avoid conversion. Conversion among UTFs is fast and reliable, but still takes some time and code. Conversion also needs to extend beyond the string representation itself to string indexes, offsets and lengths, which can be visible across a protocol (e.g., SQL) or a software boundary (e.g., Java/JNI).
Another potential problem is that while conversion between UTFs is lossless, conversion between 8/16/32-bit Unicode strings which are not well-formed UTF-8/16/32 strings is not defined. (See Encoding Forms and Unicode Strings in Unicode 4 chapter 2 [U4ch2].) This means that strings would have to be well-formed at internal encoding form boundaries, instead of only at system boundaries to the outside. (For this reason, there have been proposals to standardize the conversion of malformed Unicode strings, especially between 8-bit and 16-bit Unicode.)
Using the same form of Unicode results in seamless text handling without any conversions of strings or associated indexes. It simplifies the development and use of libraries and other reusable code and avoids the development of complex algorithms multiple times, each optimized for a different encoding form.
Most major software with good Unicode support uses UTF-16 (or 16-bit Unicode strings). Note that much of the software listed below runs on Unix/Linux systems as well as Windows and others.
The programming languages above support Unicode strings as part of the language and its standard functions. While an ISO Technical Report is being prepared for 16/32-bit string data types and string literals in C [ISO19769], use of Unicode strings additionally requires Unicode libraries like [ICU].
Of course, there are also examples for software that supports Unicode and uses UTF-8 and/or UTF-32 internally, sometimes even with good Unicode support, like Perl.
Unicode is the best way to process and store text. While there are several forms of Unicode that are suitable for processing, it is best to use the same form everywhere in a system, and to use UTF-16 in particular for two reasons:
[Ada95] | Ada 95 Reference Manual, section 3.5.2 Character Types http://www.grammatech.com/rm95html-1.0/rm9x-03-05-02.html |
[Cobol] | Enterprise Cobol for z/OS and OS/390 http://www.ibm.com/software/awdtools/cobol/zos/about/ See Support for Unicode. |
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues. |
[FormsU] | Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ Overview and illustration of the Unicode character model and encoding forms. |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[ICU] | International Components for Unicode http://www.icu-project.org/userguide/strings.html |
[ISO19769] | ISO/IEC JTC1/SC22/WG14 - C is working on a Technical
Report on new character types, including support for UTF-16. The title is: TR 19769 - Extensions for the programming language C to support new character data types. http://std.dkuug.dk/JTC1/SC22/WG14/www/projects#19769 |
[ProgSurr] | International Programming with Unicode Surrogates MultiLingual Computing & Technology #47 Volume 13 Issue 3 pp. 51-55 http://www.multilingual.com/ Brief overview of Unicode/UTF-16 and some optimization techniques. |
[Python] | Unicode in Python http://www.jorendorff.com/articles/unicode/python.html See u'' strings and \Uhhhhhhhh |
[Rosette] | Rosette Core Library for Unicode http://www.basistech.com/products/rclu.html |
[SAP] | UTF-16 and C/C++ Language Presentation at IUC 18 by Keishiro Tanaka - Fujitsu Limited & Markus Eble - SAP AG http://www.unicode.org/iuc/iuc18/a336.html Mentions that SAP uses UTF-16 and discusses UTF-8/16/32 tradeoffs. Quote from Wilhelm Nüßer and Markus Eble from SAP in an email from 2001-07-12 (http://mail.nl.linux.org/linux-utf8/2001-07/msg00064.html): for our sort of application (ie. high memory load, cross platform, many, many strings in memory, networked etc.) utf16 based coding is the most efficient - internal - presentation, i.e. the one with the highest median information density, for the great majority of characters.(Quote authorized for this document by Markus Eble via email on 2003-10-01.) |
[Sybase] | Sybase Unilib supports 16-bit string handling http://sybooks.sybase.com/onlinebooks/group-ucarc/ucg0200e/ulrefman/@Generic__BookTextView/178 |
[Symbian] | Character Conversion Overview http://www.symbian.com/developer/techlib/v70docs/sdl_v7.0/doc_source/devguides/cpp/base/characterconversion/characterconversionoverview.guide.html |
[U4ch2] | The Unicode Standard, Version 4.0.0, chapter 2
General Structure http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf See 2.5 Encoding Forms and 2.7 Unicode Strings |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them. |
Thanks to Markus Eble and Nobuyoshi Mori for helpful information and to Mark Davis, Rick McGowan and Ken Whistler for their feedback on this document.
The following summarizes modifications from the previous version of this document.
2008-10-01 | Updated stale links in version 1 |
1 | Initial version |
© 2004 Markus W. Scherer. This publication is protected by copyright, and permission must be obtained from the author and Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the Terms of Use.
Use of this publication is governed by the Unicode Terms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.