L2/01-220
==========================================================================
Proposal fo a C/C++ language extension to support portable UTF-16
2001-5-5
At the most recent International Unicode Conference, SAP and Fujitsu
reported on the needs for and experience with a C/C++ language extension
to support portable programs using UTF-16. (The conference paper is
document NCITS L2/001-195)This document is intended to
summarize the issues and the proposed solution and bring them to the
attention of JTC1 SC22/WG20 for input into the relevant programming
language committees.
For a discussion of the Unicode Encoding Forms, see Appendix B.
Reasons why UTF-16 is a rational choice for processing code
-----------------------------------------------------------
UTF-16 is a common choice for processing code. It has the following
advantages:
1) For average text it alnost always uses 50% less space than UTF-32
(see Appendix A)
2) It's easy to migrate from UCS-2 implementations
3) It's the platform character set on Java, OS-X, and Windows
4) It is supported by databases and middleware
However, it is difficult to write portable C and C++ programs supporting
UTF-16 since it neither is a multibyte nor a wide character data type.
While it is variable length, an average text encoded in UTF-16 will contain
few or any pairs of UTF-16 codes. This is the opposite case from multibyte
codes (including UTF-8) where single byte characters are the exception.
As a result, the performance of UTF-16 as a processing code tends to be
quite good.
The performance of UTF-32 as processing code for the same data
may actually be worse, since the additional memory overhead means that
cache limits will be exceeded more often and paging will occur more
frequently. For systems with processor designs with penalties for 16-bit
aligned access, but very large memories, this effect may be less.
For these reasons, UTF-16 will continue to remain a viable choice for
processing code for large, portable applications.
Difficulties of writing portable programs supporting UTF-16
-----------------------------------------------------------
It is certainly possible to write portable C and C++ programs using an
unsigned, fixed 16-bit integral data type as the character. However,
doing so means that one cannot use literal strings or the platform's
runtime libraries.
The need to duplicate the runtime libraries is annoying, but it may
not be too significant an issue, especially for larger applications.
On the other hand, existing larger applications contain thousands of
literal strings, even after all the user-visible strings have been
externalized for internationalization. Lack of support for literal
strings is therefore a significant barrier to porting existing programs.
Supporting literal strings as UTF-16 is currently only possible by
providing a custom compiler extension, or by reformulating them into
arrays with static initializers, which degrades source code readability.
Proposed solution:
-----------------
In addition to the wchar_t datatype, support a utf16_t datatype, with the
following features:
utf16_t is an unsigned, 16-bit quantity on all platforms
u"abc" is a literal that will be compiled into a string of utf16_t
For the runtime library function names, the prefix "uc" can be
used
in analogy to the current use of "wc".
Since the goal is the support of porting large existing bodies of
software to UTF-16, the recommendation is to provide all existing
"wc" runtime functions in a "uc" version, even though more
modern
approaches for internationalization exist.
Appendix A: Frequency of characters that require two UTF-16 code units
----------------------------------------------------------------------
For average text UTF-16 uses 49.99 - 50% less space than UTF-32. For any
"average" Han text, clearly more than 99.99% of character tokens are
going
to be accounted for by the CJK Unified Ideographs block and CJK Ideographs
Extension A, both of which are encoded using single code units.
CJK Extension B characters, which require two units, are going to be quite
rare in regular text, except perhaps for special applications such as
dictionaries. Estimating even "1 in a 1000" is not stretching things,
by any means.
As for the rest of the supplementary characters requiring two code units,
again, "average" text will never need to touch them. Only very
unusual
corpora, such as historic texts (e.g. Gothic) per se, will make extensive
use of them, and those unusual corpora are themselves quite likely to
constitute less than 0.01% of text by bulk.
Documents using formal mathematical notation will make use of the
Mathematical Alphanumeric Symbols for a total of a few percent of the
characters, but except for a scientific publications database, these
texts will rarely be the "average" text.
Appendix B: Unicode Encoding Forms
----------------------------------
The following has been adapted from the Technical Introduction
(http://www.unicode.org/unicode/standard/principles.html).
The Unicode Standard defines three encoding forms that allow the
same data to be transmitted in a byte, word or double word oriented
format (i.e. in 8, 16 or 32-bits per code unit). All three encoding
forms encode the same common character repertoire and can be
efficiently transformed into one another without loss of data.
The Unicode Consortium fully endorses the use of any of these
encoding forms as a conformant way of implementing the Unicode
Standard.
UTF-8 is popular for HTML and similar protocols. UTF-8 is a way
of transforming all Unicode characters into a variable length
encoding of bytes. It has the advantages that the Unicode
characters corresponding to the familiar ASCII set have the
same byte values as ASCII, and that Unicode characters transformed
into UTF-8 can be used with much existing software without
extensive software rewrites. However, extensive processing in
UTF-8 is expensive due to the variable length of characters encoded
in UTF-8.
UTF-16 is popular in many environments that need to balance
efficient access to characters with economical use of storage.
It is reasonably compact and all the heavily used characters fit
into a single 16-bit code unit, while all other characters are
accessible via pairs of 16-bit code units.
UTF-32 is popular where memory space is no concern, but fixed
width, single code unit access to characters is desired. Each
Unicode character is encoded in a single 32-bit code unit
when using UTF-32.
All three encoding forms need at most 4 bytes (or 32-bits)
of data for each character. For some common operations on text
data it is necessary to consider sequences of Unicode Characters,
reducing the advantage of using a fixed-width encoding form.
For more information see Unicode Technical Report #17,
Unicode Character Encoding Model available at
http://www.unicode.org/unicode/reports/tr17/