ISO/IEC
JTC1 SC22 WG14 N959
L2/01-220
==========================================================================
Proposal
fo a C/C++ language extension to support portable UTF-16
2001-5-5
At the most recent International Unicode Conference, SAP and
Fujitsu
reported on the needs for and experience with a C/C++ language
extension
to support portable programs using UTF-16. (The conference paper
is
document NCITS L2/001-195)This document is intended to
summarize the
issues and the proposed solution and bring them to the
attention of JTC1
SC22/WG20 for input into the relevant programming
language committees.
For a discussion of the Unicode Encoding Forms, see Appendix B.
Reasons why UTF-16 is a rational choice for processing
code
-----------------------------------------------------------
UTF-16 is
a common choice for processing code. It has the following
advantages:
1) For average text it alnost always uses 50% less space than UTF-32
(see
Appendix A)
2) It's easy to migrate from UCS-2 implementations
3) It's the
platform character set on Java, OS-X, and Windows
4) It is supported by
databases and middleware
However, it is difficult to write portable C and C++ programs
supporting
UTF-16 since it neither is a multibyte nor a wide character data
type.
While it is variable length, an average text encoded in UTF-16 will
contain
few or any pairs of UTF-16 codes. This is the opposite case from
multibyte
codes (including UTF-8) where single byte characters are the
exception.
As a result, the performance of UTF-16 as a processing code tends
to be
quite good.
The performance of UTF-32 as processing code for the same data
may
actually be worse, since the additional memory overhead means that
cache
limits will be exceeded more often and paging will occur more
frequently. For
systems with processor designs with penalties for 16-bit
aligned access, but
very large memories, this effect may be less.
For these reasons, UTF-16 will continue to remain a viable choice
for
processing code for large, portable applications.
Difficulties of writing portable programs supporting
UTF-16
-----------------------------------------------------------
It is
certainly possible to write portable C and C++ programs using an
unsigned,
fixed 16-bit integral data type as the character. However,
doing so means
that one cannot use literal strings or the platform's
runtime libraries.
The need to duplicate the runtime libraries is annoying, but it may
not be
too significant an issue, especially for larger applications.
On the other hand, existing larger applications contain thousands
of
literal strings, even after all the user-visible strings have
been
externalized for internationalization. Lack of support for
literal
strings is therefore a significant barrier to porting existing
programs.
Supporting literal strings as UTF-16 is currently only possible
by
providing a custom compiler extension, or by reformulating them
into
arrays with static initializers, which degrades source code
readability.
Proposed solution:
-----------------
In addition to the wchar_t
datatype, support a utf16_t datatype, with the
following features:
utf16_t is an unsigned, 16-bit quantity on all platforms
u"abc" is a
literal that will be compiled into a string of utf16_t
For the runtime library function names, the prefix "uc" can be used
in
analogy to the current use of "wc".
Since the goal is the support of porting large existing bodies of
software
to UTF-16, the recommendation is to provide all existing
"wc" runtime
functions in a "uc" version, even though more modern
approaches for
internationalization exist.
Appendix A: Frequency of characters that require two UTF-16 code
units
----------------------------------------------------------------------
For
average text UTF-16 uses 49.99 - 50% less space than UTF-32. For
any
"average" Han text, clearly more than 99.99% of character tokens are
going
to be accounted for by the CJK Unified Ideographs block and CJK
Ideographs
Extension A, both of which are encoded using single code
units.
CJK Extension B characters, which require two units, are going to be
quite
rare in regular text, except perhaps for special applications such
as
dictionaries. Estimating even "1 in a 1000" is not stretching
things,
by any means.
As for the rest of the supplementary characters requiring two code
units,
again, "average" text will never need to touch them. Only very
unusual
corpora, such as historic texts (e.g. Gothic) per se, will make
extensive
use of them, and those unusual corpora are themselves quite likely
to
constitute less than 0.01% of text by bulk.
Documents using formal mathematical notation will make use of
the
Mathematical Alphanumeric Symbols for a total of a few percent of
the
characters, but except for a scientific publications database,
these
texts will rarely be the "average" text.
Appendix B: Unicode Encoding
Forms
----------------------------------
The following has been adapted
from the Technical
Introduction
(http://www.unicode.org/unicode/standard/principles.html).
The Unicode Standard defines three encoding forms that allow the
same data
to be transmitted in a byte, word or double word oriented
format (i.e. in 8,
16 or 32-bits per code unit). All three encoding
forms encode the same common
character repertoire and can be
efficiently transformed into one another
without loss of data.
The Unicode Consortium fully endorses the use of any of
these
encoding forms as a conformant way of implementing the
Unicode
Standard.
UTF-8 is popular for HTML and similar protocols. UTF-8 is a way
of
transforming all Unicode characters into a variable length
encoding of bytes.
It has the advantages that the Unicode
characters corresponding to the
familiar ASCII set have the
same byte values as ASCII, and that Unicode
characters transformed
into UTF-8 can be used with much existing software
without
extensive software rewrites. However, extensive processing
in
UTF-8 is expensive due to the variable length of characters encoded
in
UTF-8.
UTF-16 is popular in many environments that need to balance
efficient
access to characters with economical use of storage.
It is reasonably compact
and all the heavily used characters fit
into a single 16-bit code unit, while
all other characters are
accessible via pairs of 16-bit code units.
UTF-32 is popular where memory space is no concern, but fixed
width,
single code unit access to characters is desired. Each
Unicode character is
encoded in a single 32-bit code unit
when using UTF-32.
All three encoding forms need at most 4 bytes (or 32-bits)
of data for
each character. For some common operations on text
data it is necessary to
consider sequences of Unicode Characters,
reducing the advantage of using a
fixed-width encoding form.
For more information see Unicode Technical Report #17,
Unicode Character
Encoding Model available
at
http://www.unicode.org/unicode/reports/tr17/