Developing UTF-8 support

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Sat Sep 22 2001 - 14:08:36 EDT


When developing xIUA, I designed UTF-8 support to be used two different
ways. One as a form of Unicode and the other as yet another code page. In
either case the two are handled with few exceptions in the same manor. The
only difference it when you want to convert from UTF-8 to an underlying code
page. It one case you can have an underlying code pages such as iso-5589-1
or whatever and when UTF-8 is the code page there is no underlying code
page.

To support UTF-8 data I have defined UChar8 which is unsigned char to ensure
consistency and make sure that the strings are treaded consistently across
platforms. Functions starting with xiua_ are common routines. Functions
starting with xiu8_ are explicit UTF-8 support functions.

There are conversion and UTF-8 transformation services. I started
developing these for ICU 1.4. I added full UTF-16 support character
boundary support and consistent null terminated string support. Had I
started developing a package starting with the upcoming ICU 2.0 I might have
used the ICU support. There would have been a little extra overhead going
through the common converter interface but there would not have been the
duplication of code.

I will retain this support. With lots of short transformations for
parameters I want as little overhead as possible. With very short fields
the processes of resetting a converter to insure that it is in the proper
state could take as much processing as the transform. Another reason is
that the code is well integrated into the application. For example I use
one of the translation tables to determining UTF-8 character length in may
places in the code. This code also allows you to work in a mix of UTF-8 and
UTF-32 in terms of logical characters. In UTF-32 code units and characters
are the same. But being an MBCS system this is not true of UTF-8. I can
specify number of bytes, number of characters or string length when
converting to UTF-32. There are direct UTF transform routines between any
UTF format to any other. Thus a UTF-8 to UTF-32 transform is faster that
UTF-8 to UTF-16. I will use ICU's converters for all code pages and then
transform if needed but not between UTF formats. I also use the ICU UTF-8
support macros where possible and the performance is comparable.

The actual UTF-8 support functions fall into two major categories. There
are explicit UTF-8 support implementations. They range from UTF-8 data
validation routines to string handling routines. The other routine use
common UTF-16 or UTF-32 routines.

Some routines like strcmp could use a common routine but the overhead would
be too great. For UTF-8 a standard strcmp will do. To make it compare the
same for all forms of Unicode, it uses the ICU u_strcmpCodePointOrder
functions which is a very efficient routine to compare UTF-16 code in
Unicode code point order. I use a very similar routine for xiu2_strncmp.
Some routines like xiu8_strtok not only return a pointer into the original
string but it inserts nulls into the string. This kind of code must have
separate implementations. Some functions have to be implemented slightly
differently. UChar8 * xiu8_strchrEx(UChar8 *string,UChar8 * charptr); is an
example of such a function. You are searching for logical characters that
may be up to 4 bytes long. It is impractical to pass the character you are
searching for as an int.

Other routines are best handled by a common routine. For example if you do
a strcoll you will want to convert the data from UTF-8 to UTF-16 and call
ICU. Because xIUA is a starting model and if you are using UTF-8 you may
want to tailor it in one of two ways. You will notice that while there is
an xiua_strcoll there is no xiu8_strcoll. This is because some may want to
use xiu8_strcollEx where you specify collation strength and normalization
and others will want to only use a standard strcoll. Those that want to
also use an xiu8_strcollEx can add a #define xiu8_strcoll(a, b)
xiu8_strcollEx(a, b, XCOL_TERTIARY)

For routines like case shifting routines will have to convert from UTF-8 to
UTF-16 and convert the result back to UTF-8. Because of special casing you
should use a separate source and target buffer because the result may be
larger or small that the original string. The routine should also map
lengths as is converts back and forth.

UTF-8 is not a good format for case shifting. There is special code in xIUA
for those cases where the application was not designed properly to allow you
to use a different results buffer. xiua_strtoupperInplace(char *string);
gives you a last ditch workaround that uses a common UTF-32 case shift
routine. It uses the ICU u_toupper which is actually a UTF-32 function for
efficiency.

To make this UTF transformation work efficiently we need another piece. We
need storage for work areas and intermediate results. xIUA has its own
storage management that minimizes the malloc/free overhead. The large
segments are often conversions. It if it uses ICU to convert a code page to
UTF-8 it needs intermediate storage for the conversion. To keep the
intermediate storage at a minimum it will convert in chunks. You can tune
this so that the chunks will be large enough to use the conversion
facilities efficiently but not use too much storage.

To make UTF-8 really usable we also need other specialized routines. The
are the expected routines that deal with characters issues such as character
length number of logical characters in a string and string navigation aids
but often you need other not quite so obvious routines. A good example is
xiu8_strncpyEx. This routine differs from a normal strncpy function in that
it will only copy complete character and always adds a null to the end of
the string. This routine can be used where you want to break data into
chunks or limit the size of a string but it will not produce broken or split
UTF-8 characters.

There are more details that go into UTF-8 support but these are some of the
major issues. It is not had to take existing MBCS code and treat UTF-8 as
another characters set by customizing the length detection for UTF-8 just
like any other character set. This routines however do not treat UTF-8 as
Unicode. Some libraries only support 3 byte UTF-8 encoding and do not
support characters in the other planes. Others are even more limiting in
that they may limit some functions like case shifting and character
searching to the ASCII portions of UTF-8.

I have found that a proper UTF-8 support library will have full Unicode 3.1
support including conformance to the new UTF-8 specifications. Also look a
the functions like strchr. Do they have a way to search for the full range
of UTF-8 characters? Also look at case shifting. Do they provide different
input and output areas? If not they probably do not implement special
casing and you will not get proper case shifting even for common languages
like German. The other factor is check for good locale support. Many UTF-8
functions are locale independent but other routines like collation,
date/time and numeric formation are locale dependent. Is the locale support
thread independent?

Carl



This archive was generated by hypermail 2.1.2 : Sat Sep 22 2001 - 13:13:51 EDT