Re: Name Compression. Comparison and Tweaks

From: Markus Scherer (
Date: Wed May 17 2000 - 20:29:15 EDT


i just wanted to advertise my own implementation for this that i did a few months ago for icu. the generated file contains all names as defined in Unicode 3.0.0, allows fast random-access, and includes data about the algorithmic names for the cjk and hangul blocks.

the data file is 83860 bytes long, with a word list and tweaks as discussed here. it does not use huffman or similar.

the generator code is part of icu, thus open source. have a look at

with a command-line option on the generator tool you can include the Unicode 1.0 names, and it allows you to ask for either one. this option bumps the file size to just over 100000 bytes.

icu has an api (u_charName() in icu/source/common/unicode/uchar.h) that will give you the name for all 49194 characters in Unicode 3.0, including the algorithmic ones.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT