Re: Name Compression. Comparison and Tweaks

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed May 17 2000 - 20:29:15 EDT


hi,

i just wanted to advertise my own implementation for this that i did a few months ago for icu. the generated file contains all names as defined in Unicode 3.0.0, allows fast random-access, and includes data about the algorithmic names for the cjk and hangul blocks.

the data file is 83860 bytes long, with a word list and tweaks as discussed here. it does not use huffman or similar.

the generator code is part of icu, thus open source. have a look at http://oss.software.ibm.com/developerworks/opensource/cvs/icu/source/tools/gennames/

with a command-line option on the generator tool you can include the Unicode 1.0 names, and it allows you to ask for either one. this option bumps the file size to just over 100000 bytes.

icu has an api (u_charName() in icu/source/common/unicode/uchar.h) that will give you the name for all 49194 characters in Unicode 3.0, including the algorithmic ones.

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT