Re: Name Compression. Comparison and Tweaks

From: Markus Kuhn (
Date: Sat May 13 2000 - 04:40:43 EDT

Robert Brady wrote on 2000-05-13 02:04 UTC:
> a) it is still a totally pointless re-invention of axel-based disc
> locomotion. better compression algorithms already exist, and more to the
> point, they have been tested and won't suffer odd bugs due to being
> reimplemented thousands of times.
> b) it is not as easy to implement as a few calls to zlib.

This is what quite a number of people have already done in their
applications. It is not very elegant though.

The problem with zlib is that you have to decompress each time from the
very beginning of the compressed data block. This requires much more
time and memory than really needed for the task -> ugly memory bloat. A
more ad-hoc compression method as outlined by John, Kenneth, and various
other people before allows you direct access decompression.

What we really need in Unicode libraries is a very memory efficient fast
codenumber_to_name() conversion function. Ideally, this should be based
on some common compressed database file /etc/unidata that is simply
memory mapped by the unicode library of each application and then
traversed efficiently for each query. It would be extremely helpful if
we could agree (e.g., in the form of a Unicode technical report) on a
file format and standard path for such a compressed /etc/unidata
(or /usr/share/unidata) file, that we should then soon be able to expect
on any recent POSIX installation, just like /etc/services and /etc/passwd.

It is important that the file format is specified, as opposed to just
some API, such that the compressed unidata database becomes binary
compatible across platforms and programming languages.


Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at,  WWW: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT