Name Compression. Comparison and Tweaks

From: Kenneth Whistler (
Date: Fri May 12 2000 - 18:37:03 EDT

John Cowan started the thread re the names list compression,
and before we pack it in for the weekend, I thought a short
summary comparison and some tweaks might be in order.

Approach 1: John Cowan

Create a word frequency table for the entire names list. Recode
the most common words as single bytes. In the Perl code that
John provided, the 124 most common words are recoded as single
bytes, and the rest of the names are left alone.

Claimed compression: about 50%.

Approach 2: Torsten Mohrin

Create a word frequency table for the entire names list. Recode
the 64 most common words as a single byte, and represent all other
words in the list as a two-byte combination. Use 2 bits of each
byte to mark the following character as either EOS, SPACE, hyphus,
or lead byte.

Claimed compression: 262,000 bytes ==> 59,000 bytes for the recoded
names + 29,000 bytes for the code->word lookup table. Total size,
around 88,000 bytes.

Approach 3: Pierpaolo

In a Lisp environment, just put each unique word in the names into
a table, and then treat each name as a vector of vectors.

Claimed compression: not designated. However, I don't think this
approach would be a very competitive one when it comes to compression.

My analysis of the Unicode 3.0.0 names list shows the names alone
as comprising 278,346 bytes (counting a one-byte delimiter between
each) for 10,538 names (omitting the control codes and the ranges
of characters with algorithmically derivable names). If SPACE and
hyphus are taken to be word delimiters for the purposes of name
analysis, the list has 43,430 word tokens in it.

Without doing some other fancy work, representation of a list of
43,430 word tokens by pointing to a table of unique word tokens
is going to take 43,430 pointers, and at 4 bytes each for a typical
32-bit implementation, that is 173,720 bytes, without counting the
table of unique words itself.


Now for the tweaks. In thinking about John Cowan's approach, I
turned up a several quick optimizations.

1a. By giving up on John's ASCII preservation, and using more of
    the available bytes above 0x20 for encoding, you can push the
    number of byte encodings for words from 124 to 184 without
    impinging on control bytes. (Of course, you can go even higher
    if you don't expect the recoded names to be processible as
    C strings and make use of control bytes as well.) The only
    bytes I avoided were for SPACE, '-', '_', '=' (see below),
    'A-Z', '0-9'. All others were fair game for use in the recoding.

1b. By some simple preprocessing of the names list for long,
    multiple-word, recurring patterns, you can end up with those
    complete patterns as "words" for recoding as single bytes.
    This also results in considerable savings. Thus, for example,
    treating "LATIN CAPITAL LETTER" as a unit is much better than
    recoding each individual word with a byte. I did this preprocessing
    by simply inspecting the names list for obvious repeated patterns,
    and then choosing 30 of them for handling as phrases. This was
    easy to do -- just by replacing the SPACE's by "_" for each of
    these patterns, and then letting the frequency algorithm find
    the entire phrases as units for coding assignment. You could do
    the same for phrases such as "LESS-THAN" separated by a hyphus.
    Replace the '-' with a '=' and then let the frequency parser
    find the entire unit. The art here is in picking the right number
    of the right multiple-word patterns to get the best results.

1c. Rather than picking the top candidates for recoding by their
    absolute frequency, a much more optimal solution is to pick
    the top candidates based on their net displacement of characters
    if recoded as a single byte. Thus 100 instances of "TO" would
    only displace 100 bytes if recoded as a single byte, but
    100 instances of "IDEOGRAPH" would displace 800 bytes. Sorting
    by net displacement of characters rather than raw frequency gives
    a considerably better high frequency list to recode using the
    184 single bytes available.

1d. When replacing a word (or phrase) by a single byte, I just swallow
    up any following SPACE character, but leave the less common hyphuses
    alone. When reconstituting the name, any single byte in the recoding
    byte range gets expanded to the full word (or phrase) plus a following
    SPACE, unless the following character is a hyphus or the EOS. This
    allows accurate reconstitution of the names, without having to store
    any bit twiddles, and compresses a lot of additional SPACE characters
    out of the names.

This approach can result in *very* good compressions for common characters
like the Latin, Greek, and Cyrillic character names. LATIN CAPITAL LETTER A,
for example, compresses into two bytes: "LATIN CAPITAL LETTER " + "A".

With all of these optimizations of John Cowan's basic algorithm in place,
you can get the following results:

Using 184 byte codes, with 30 common phrases preprocessed as word units:

   Name data size: 86,427 bytes (including a single byte delimiter between
   Number of unique "words": 4309.
   Number of unique "words" in the code->name table: 184.
   Code->name table size: 1769 bytes (including the delimiters).

This is comparable to the sizes reported for Torsten's approach.
Of course, this is not the *complete* table size. You still need to put
in place an index access mechanism to be able to look up the name strings
from their Unicode value. But that is overhead that any of the compression
approaches will have to deal with.

I know Juliucz looks askance at the creation of new, ad hoc,
compression algorithms by the Unicode crowd. But if your problem is finding
a memory- and computation-efficient way to store the entire Unicode name list
(or some significant subset of it) in memory for rollover hints, or whatever,
the approach I have outlined above results in a pretty good tradeoff and
is rather easy to implement.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT