Re: 3rd-party cross-platform UTF-8 support

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 20 2001 - 15:46:49 EDT


Changjian Sun said:

> For cross-platform software (NT,Solaris,HP,AIX), the only 3rd-party
> unicode support
> I found so far is IBM ICU.
> It's a very good support for cross-platform software internationalization.
> However,
> ICU internally uses UTF-16, For our application using UTF-8 as input and
> output,
> I have to convert from UTF-8 to UTF-16, before calling ICU functions (such
> as ucol_strcoll() )
>
> I'm worried about the performance overhead of this conversion.

You shouldn't be.

The conversion from UTF-8 to UTF-16 and back is algorithmic and very
fast.

If you are expecting better performance from a library that takes UTF-8
API's and then does all its internal processing in UTF-8 *without*
converting to UTF-16, then I think you are mistaken. UTF-8 is a bad
form for much of the kind of internal processing that ICU has to do
for all kinds of things -- particularly for collation weighting, for
example. Any library worth its salt would *first* convert to UTF-16
(or UTF-32) internally, anyway, before doing any significant semantic
manipulation of the characters.

> Are there any other cross-platform 3rd party unicode supports with better
> UTF-8 handling ?

In my opinion, it is unlikely that there are *any* good Unicode libraries
that provide pure UTF-8 handling only, inside and out. It is just
more efficient, elegant, and higher-performance to take the form
conversion hit, but then use a better processing form for manipulating
the characters.

UTF-8 shines as a legacy API and protocol compatibility form.
But it stinks as a processing form.

--Ken

> Thanks a lot.
>
> -Changjian Sun



This archive was generated by hypermail 2.1.2 : Thu Sep 20 2001 - 14:45:06 EDT