Carl W. Brown writes:
> If you implement an array that is directly indexed by Unicode code point it
> would have to have 1114111 entries. (I love the number) I don't think that
> many applications can afford to have over a megabyte of storage per byte of
> table width. If nothing else it would be an array of addresses pointing to
> valid entries that would take about 4.5 MB. Because the new plains are
> sparsely populated you can segment your table. In this case you have no
> real advantage using UTF-32.
That wasn't my point: obviously one would not create a lookup table
using raw Unicode values.
But if I have a text string, and that string is encoded in UTF-16, and
I want to access Unicode character values, then I cannot index that
string in constant time.
To find character n I have to walk all of the 16-bit values in that
string accounting for surrogates. If I use UTF-32 I don't need to do
that. This very issue came up during the discussion of how to handle
surrogates in Python.
> I though that Basis Technology was developed using UCS-2. Have you
> converted to full UTF-16 support or are you thinking of changing?
The current shipping version of Rosette uses UCS-2 internally. Current
development is focusing on UTF-16 and UTF-32 support.
-tree
-- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
This archive was generated by hypermail 2.1.2 : Mon Sep 24 2001 - 13:59:09 EDT