Re: unidata is big

From: Theo Veenker (Theo.Veenker@let.uu.nl)
Date: Wed Apr 24 2002 - 04:26:55 EDT


andreas palsson wrote:
>
> Hi.
>
> I would just like to know if someone could give me a tip on how to
> structure all the unicode-information in memory?
>
> All the UNIDATA does contain quite a bit of information and I can't see
> any obvious method of which is memory-efficient and gives fast access.

You might want to evaluate some of the open source libraries
mentioned under "Enabled Products" on the unicode site. For my
own lib (http://www.let.uu.nl/~Theo.Veenker/personal/projects/ucp/)
I've created a seperate table builder tool for each property or
mapping. The tools organize data in planes, and for each plane
all possible trie setups are determined (about 80 combinations
of one, two or three stage tables). Then the cheapest setup
is used. This still requires over 230kb to store all data
(except character names and comments) from the following files:
UnicodeData.txt, EastAsianWidth.txt, LineBreak.txt, ArabicShaping.txt,
Scripts.txt, Blocks.txt, SpecialCasing.txt, CaseFolding.txt,
BidiMirroring.txt, PropList.txt, DerivedCoreProperties.txt,
DerivedNormalizationProperties.txt, and DerivedJoiningType.txt.
For some mappings I've stored 32 bit code points where 16 bit
would have been enough, but I decided API uniformness is more
important than memory efficiency.

I wouldn't bother too much about memory efficiency; it's irrelevant
these days. Even your mobile phone has enough memory to store all
unicode data 10..20 times. Same thing for lookup speed. All you have
to do to get it fast is to wait (a few seasons).

Theo



This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 05:21:50 EDT