From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 24 2007 - 19:01:02 CST
Oliver Block asked:
> I've started to read (some chapters of) the Unicode standard last year. I was
> just curious what is an appropriate way to implement extensive amount of data
> like character properties.
The usual ways are by various schemes that compress tables, while
still giving good lookup speed. One widely used strategy is the
use of tries:
http://en.wikipedia.org/wiki/Trie
In the case of Unicode character properties, the intermediat node keys are
not strings, but instead tend to be bit partitions of the
code point values. For example, character properties for
the 64K code points 0..FFFF can be efficiently accessed by
dividing the 16 bit values into 8 high bits and 8 low bits,
and then compressing the parts of the lookup where property
values are shared for many of the terminal nodes of the resulting table.
For characters in the range 10000..FFFFF, a different bit
partition might work better -- for example 10 high bits and
10 low bits.
Another common strategy is using bit arrays, and compressing them
with techniques that drop homogenous ranges of values.
> In fact the data that need to be stored for timezones is quite extensive, too.
For time zones, you are talking about the kinds of data which are
*not* part of the Unicode Standard and the Unicode Character
Database, but instead are part of all the localization data needed
to support programs running in different languages, locales,
time zones, and such. The Unicode Consortium maintains a
separate standard and a registry of locale data. See the
Common Locale Data Registry (CLDR):
There is also a separate email discussion list for discussing
issues of locales (including time zones):
See:
http://www.unicode.org/consortium/distlist.html
for information about that email discussion list and how to
join it.
--Ken
> - That's the reason why I asked.
This archive was generated by hypermail 2.1.5 : Wed Jan 24 2007 - 19:03:41 CST