Re: UTF-8 <> UCS-2/UTF-16 conversion for library use

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Sep 21 2001 - 13:21:04 EDT


Tree said:

> While the conversion between UTF-8 and UTF-16/UCS-2 is algorithmic and
> very fast, we need to remember that a buffer needs to be allocated to
> hold the converted result, and the data needs to be copied as things
> go in and out of the library.

Well, of course. But then I am mostly a C programmer, and tend to
think of these things in terms of preallocated static buffers that
get reused, or autoallocation on the stack, with just pointers
getting passed around to reduce data copies. With such methods,
for practical purposes, the conversions tend to be insignificant
compared to the rest of the work the API is usually engaged in.

But if you are doing object-oriented programming, it is always
a danger that you may end up multiplying your object constructions
needlessly, and to paraphrase Everett Dirksen, for the other
oldtimers out there, a billion nanoseconds here, a billion
nanoseconds there, eventually turn into real time. *hehe*

It is my impression, however, that most significant applications
tend, these days, to be I/O bound and/or network
transport bound, rather than compute bound. With a little care
in implementation, such things as string character set conversions
at interfaces do end up down in the noise, compared to the
other major issues that can affect overall performance and
throughput. Remember, these days we are dealing with gigahertz+
processors -- these are not your father's CPU's.

My point was that character set conversion at the interface to
a library -- particularly such conversions as UTF-8 <==> UTF-16
that don't even involve loading a resource table for conversion --
should not be seen as a significant barrier or performance
bottleneck. Looking for a "UTF-8 library" because it would
be more "efficient" to avoid conversions, even when a good
UTF-16 API is available, is misconstruing the problem and
(mostly) misplacing concern about performance.

> What is the real impact of this? I don't know: I haven't measured it
> myself. Obviously this could be handled a number of ways with various
> performance characteristics, but it does become an issue.

It's an issue, certainly, but to my mind, more one of a cultural
issue based on a somewhat dated set of worries, rather than a significant
performance issue.

I'm reminded somewhat of the clamor a decade ago about how
bad Unicode was because it would "double the size of our
data stores". At the time, I was working on a computer with
a 20 megabyte hard disk, and (ooh!) a new, modern, 1-megabyte
floppy disk drive. Today, my home computer has a 45-*giga*byte
hard drive. I could spend the rest of my life trying to
create enough *text* data to fill a significant portion of
that drive. It is mostly populated with code images, libraries,
artwork and other graphics, web pages, music, and what not,
as are most people's hard disks, I surmise. We don't hear
much, anymore, about how "wasteful" Unicode is in its storage
of characters.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Sep 21 2001 - 12:36:55 EDT