Re: utf-8 and databases

From: Tex Texin (tex@i18nguy.com)
Date: Mon Jul 08 2002 - 03:47:50 EDT


Asmus is right that you shouldn't blithely assume that the encoding
itself gives a performance advantage.
However, I think this is more true when looking at software program
efficiency then database efficiency.

For example, some databases preallocate storage for records based on the
fixed width of the record as n characters, and then allocate the maximum
byte size of a character times n characters- So a 100 character record
requires 400 bytes for each record, even though much of the data might
actually be only 1 or two byte characters.

You can then see some large growth in utf-8 databases over utf-16 (where
the utf-16 versions allocate 16 bits instead of the maximal 32 per
character).

Similarly index keys are affected and if the key size has a low limit,
choosing one encoding over the other might give migration headaches.

I think Asmus and I are both saying you are likely asking the wrong
question. The encoding choice is a "don't care", since there is a 1-1
relationship and a simple efficient algorithm for going between them.

What you really want to ask of the vendor, and/or be testing for, is
given the kinds of data and operations you need to perform, how
efficient is the database at using its storage facilities, retrieving
the data, and executing the various operations (search, sort, etc.),
for each encoding.

hth
tex

Asmus Freytag wrote:
>
> At 02:11 PM 7/7/02 +0700, Paul Hastings wrote:
> >is there a standard test that can determine whether a given
> >database can handle utf-8 (ie as "native" utf-8 not converting
> >to ucs-2 or whatever)?
>
> Why is that of any interest?
>
> The primary concern is whether a database is able to represent the entire
> repertoire of Unicode. Just create a string that contains the largest
> character 0x10FFFD, convert it to whatever encoding form the APIs require
> and see whether you get it back unmolested.
>
> A more sophisticated test would take a longer string and attempt to sniff
> out incorrect truncation of characters.
>
> A secondary concern is performance. If the choice of encoding form is a
> poor match for the actual data encountered, and if entering and retrieving
> the data requires too many transcoding steps, it's conceivable that this
> could be detected in the overall performance of the database.
>
> However, there's no reason to assume that a theoretical match in encoding
> efficiency translates automatically into a more efficient database
> implementation.
> Therefore, regular benchmarking tools should be fine to determine database
> performance, as long as the test data is representative for the installation.
>
> A./

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Mon Jul 08 2002 - 02:11:27 EDT