RE: 3rd-party cross-platform UTF-8 support

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Sep 24 2001 - 14:55:46 EDT


Tom,

> Andy Heninger writes:
> > Performance tuning is easier with UTF-16. You can optimize for
> > BMP characters, knowing that surrogate pairs are sufficiently uncommon
> > that it's OK for them take a bail-out slow path.
>
> Sure, but if you are using UTF-16 (or any other multibyte encoding)
> you loose the ability to index characters in an array in constant
> time. For some applications that isn't desirable.

If you implement an array that is directly indexed by Unicode code point it
would have to have 1114111 entries. (I love the number) I don't think that
many applications can afford to have over a megabyte of storage per byte of
table width. If nothing else it would be an array of addresses pointing to
valid entries that would take about 4.5 MB. Because the new plains are
sparsely populated you can segment your table. In this case you have no
real advantage using UTF-32.

I though that Basis Technology was developed using UCS-2. Have you
converted to full UTF-16 support or are you thinking of changing?

Carl



This archive was generated by hypermail 2.1.2 : Mon Sep 24 2001 - 13:42:30 EDT