RE: 3rd-party cross-platform UTF-8 support

From: Tom Emerson (tree@basistech.com)
Date: Mon Sep 24 2001 - 15:28:14 EDT


Carl W. Brown writes:
> If you implement an array that is directly indexed by Unicode code point it
> would have to have 1114111 entries. (I love the number) I don't think that
> many applications can afford to have over a megabyte of storage per byte of
> table width. If nothing else it would be an array of addresses pointing to
> valid entries that would take about 4.5 MB. Because the new plains are
> sparsely populated you can segment your table. In this case you have no
> real advantage using UTF-32.

That wasn't my point: obviously one would not create a lookup table
using raw Unicode values.

But if I have a text string, and that string is encoded in UTF-16, and
I want to access Unicode character values, then I cannot index that
string in constant time.

To find character n I have to walk all of the 16-bit values in that
string accounting for surrogates. If I use UTF-32 I don't need to do
that. This very issue came up during the discussion of how to handle
surrogates in Python.

> I though that Basis Technology was developed using UCS-2. Have you
> converted to full UTF-16 support or are you thinking of changing?

The current shipping version of Rosette uses UCS-2 internally. Current
development is focusing on UTF-16 and UTF-32 support.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



This archive was generated by hypermail 2.1.2 : Mon Sep 24 2001 - 13:59:09 EDT