Name Compression (was: Longest Names)

From: John Cowan (jcowan@reutershealth.com)
Date: Tue May 09 2000 - 15:26:38 EDT


Kenneth Whistler wrote:
 
> My own rule of thumb for processing UnicodeData.txt is to use 128 bytes
> for transient buffers for names -- which gives me a 99.999% confidence
> feeling that future versions of the data file will never break it.
> But for persistent storage I use variable length arrays anyway, since the
> average name length is so much shorter than the longest name length.

That reminds me that it's time to regenerate my Unicode character name
compression code. I might as well distribute it while I'm at it.
Here's 2 Perl programs, one to compress a UnicodeData file into a
2-column file containing code point and compressed name, and the
other to reverse the process. No copyright, no warranty, use as you will.

The compression reduces the names by about a factor of two; the longest
name is now 50 characters, for U+09F8. The algorithm assigns single bytes to
124 common words in the file, and encodes uncommon words as themselves.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)

#!/usr/bin/perl

require "table";

while (<>) {
        chomp;
        ($unicode, $name) = split(/;/);
        next if $name =~ /</; # not a name
        @name = split(/ /, $name);
        $new = "";
        $mode = 0;
        foreach $name (@name) {
                if ($name =~ /^[A-Z]$/) {
                        $new .= chr(ord($name) + 0x20);
                        $mode = 0;
                        }
                elsif ($code = $map{$name}) {
                        $new .= chr($code);
                        $mode = 0;
                        }
                else {
                        $new .= " " if $mode;
                        $mode = 1;
                        $new .= $name;
                        }
                }
        print "$unicode;$new\n";
        }


#!/usr/bin/perl

require "table";

while (<>) {
        chomp;
        ($unicode, $name) = split(/;/);
        $new = "";
        for ($i = 0; $i < length($name); $i++) {
                $char = substr($name, $i, 1);
                if ($char =~ /[ A-Z0-9-]/) {
                        $new .= $char;
                        }
                elsif ($char =~ /[a-z]/) {
                        $new .= " " . chr(ord($char) - 32) . " ";
                        }
                else {
                        $new .= " " . $map{ord($char)} . " ";
                        }
                }
        $new =~ s/ / /g;
        $new =~ s/^ //;
        $new =~ s/ $//;
        print "$unicode;$new\n";
        }


%map = (
0x21, "LETTER",
0x22, "WITH",
0x23, "SYLLABLE",
0x24, "YI",
0x25, "SMALL",
0x26, "ARABIC",
0x27, "LATIN",
0x28, "FORM",
0x29, "CAPITAL",
0x2A, "SYLLABICS",
0x2B, "CANADIAN",
0x2C, "LIGATURE",
0x2E, "HANGUL",
0x2F, "AND",
0x3A, "CJK",
0xFD, "SIGN",
0x3C, "RADICAL",
0x3D, "ETHIOPIC",
0x3E, "GREEK",
0x3F, "COMPATIBILITY",
0x40, "FINAL",
0x5B, "DIGIT",
0x5C, "PATTERN",
0x5D, "BRAILLE",
0x5E, "SQUARE",
0x5F, "CIRCLED",
0x60, "SYMBOL",
0x7B, "CYRILLIC",
0x7C, "ISOLATED",
0x7D, "KANGXI",
0x7E, "ABOVE",
0xA0, "VOWEL",
0xA1, "KATAKANA",
0xA2, "TIBETAN",
0xA3, "MEEM",
0xA4, "CARRIER",
0xA5, "INITIAL",
0xA6, "BELOW",
0xA7, "YEH",
0xA8, "DOT",
0xA9, "MONGOLIAN",
0xAA, "RIGHT",
0xAB, "MARK",
0xAC, "LEFT",
0xAD, "ARROW",
0xAE, "BOX",
0xAF, "FOR",
0xB0, "HEBREW",
0xB1, "DRAWINGS",
0xB2, "DOUBLE",
0xB3, "HALFWIDTH",
0xB4, "ALEF",
0xB5, "COMBINING",
0xB6, "HEAVY",
0xB7, "IDEOGRAPHIC",
0xB8, "PARENTHESIZED",
0xB9, "TO",
0xBA, "DEVANAGARI",
0xBB, "WHITE",
0xBC, "KHMER",
0xBD, "FULLWIDTH",
0xBE, "HAH",
0xBF, "VERTICAL",
0xC0, "JEEM",
0xC1, "CHOSEONG",
0xC2, "CHARACTER",
0xC3, "ARMENIAN",
0xC4, "BENGALI",
0xC5, "WEST-CREE",
0xC6, "THAI",
0xC7, "HIRAGANA",
0xC8, "ACUTE",
0xC9, "TWO",
0xCA, "IDEOGRAPH",
0xCB, "HOOK",
0xCC, "CHEROKEE",
0xCD, "BLACK",
0xCE, "ONE",
0xCF, "MEDIAL",
0xD0, "DASIA",
0xD1, "LIGHT",
0xD2, "JONGSEONG",
0xD3, "RUNIC",
0xD4, "TELUGU",
0xD5, "SINHALA",
0xD6, "KANNADA",
0xD7, "ORIYA",
0xD8, "PSILI",
0xD9, "MYANMAR",
0xDA, "MALAYALAM",
0xDB, "GUJARATI",
0xDC, "GEORGIAN",
0xDD, "GURMUKHI",
0xDE, "THREE",
0xDF, "SYRIAC",
0xE0, "CIRCUMFLEX",
0xE1, "DIAERESIS",
0xE2, "HAMZA",
0xE3, "FUNCTIONAL",
0xE4, "APL",
0xE5, "TELEGRAPH",
0xE6, "MAKSURA",
0xE7, "RIGHTWARDS",
0xE8, "JUNGSEONG",
0xE9, "TILDE",
0xEA, "DOWN",
0xEB, "UP",
0xEC, "LAO",
0xED, "BOPOMOFO",
0xEE, "VARIA",
0xEF, "OXIA",
0xF0, "GRAVE",
0xF1, "TAMIL",
0xF2, "KHAH",
0xF3, "MACRON",
0xF4, "MODIFIER",
0xF5, "HALF",
0xF6, "EQUAL",
0xF7, "OMEGA",
0xF8, "NUMBER",
0xF9, "VOCALIC",
0xFA, "ALPHA",
0xFB, "LAM",
0xFC, "ACCENT",
"LETTER", 0x21,
"WITH", 0x22,
"SYLLABLE", 0x23,
"YI", 0x24,
"SMALL", 0x25,
"ARABIC", 0x26,
"LATIN", 0x27,
"FORM", 0x28,
"CAPITAL", 0x29,
"SYLLABICS", 0x2A,
"CANADIAN", 0x2B,
"LIGATURE", 0x2C,
"HANGUL", 0x2E,
"AND", 0x2F,
"CJK", 0x3A,
"SIGN", 0xFD,
"RADICAL", 0x3C,
"ETHIOPIC", 0x3D,
"GREEK", 0x3E,
"COMPATIBILITY", 0x3F,
"FINAL", 0x40,
"DIGIT", 0x5B,
"PATTERN", 0x5C,
"BRAILLE", 0x5D,
"SQUARE", 0x5E,
"CIRCLED", 0x5F,
"SYMBOL", 0x60,
"CYRILLIC", 0x7B,
"ISOLATED", 0x7C,
"KANGXI", 0x7D,
"ABOVE", 0x7E,
"VOWEL", 0xA0,
"KATAKANA", 0xA1,
"TIBETAN", 0xA2,
"MEEM", 0xA3,
"CARRIER", 0xA4,
"INITIAL", 0xA5,
"BELOW", 0xA6,
"YEH", 0xA7,
"DOT", 0xA8,
"MONGOLIAN", 0xA9,
"RIGHT", 0xAA,
"MARK", 0xAB,
"LEFT", 0xAC,
"ARROW", 0xAD,
"BOX", 0xAE,
"FOR", 0xAF,
"HEBREW", 0xB0,
"DRAWINGS", 0xB1,
"DOUBLE", 0xB2,
"HALFWIDTH", 0xB3,
"ALEF", 0xB4,
"COMBINING", 0xB5,
"HEAVY", 0xB6,
"IDEOGRAPHIC", 0xB7,
"PARENTHESIZED", 0xB8,
"TO", 0xB9,
"DEVANAGARI", 0xBA,
"WHITE", 0xBB,
"KHMER", 0xBC,
"FULLWIDTH", 0xBD,
"HAH", 0xBE,
"VERTICAL", 0xBF,
"JEEM", 0xC0,
"CHOSEONG", 0xC1,
"CHARACTER", 0xC2,
"ARMENIAN", 0xC3,
"BENGALI", 0xC4,
"WEST-CREE", 0xC5,
"THAI", 0xC6,
"HIRAGANA", 0xC7,
"ACUTE", 0xC8,
"TWO", 0xC9,
"IDEOGRAPH", 0xCA,
"HOOK", 0xCB,
"CHEROKEE", 0xCC,
"BLACK", 0xCD,
"ONE", 0xCE,
"MEDIAL", 0xCF,
"DASIA", 0xD0,
"LIGHT", 0xD1,
"JONGSEONG", 0xD2,
"RUNIC", 0xD3,
"TELUGU", 0xD4,
"SINHALA", 0xD5,
"KANNADA", 0xD6,
"ORIYA", 0xD7,
"PSILI", 0xD8,
"MYANMAR", 0xD9,
"MALAYALAM", 0xDA,
"GUJARATI", 0xDB,
"GEORGIAN", 0xDC,
"GURMUKHI", 0xDD,
"THREE", 0xDE,
"SYRIAC", 0xDF,
"CIRCUMFLEX", 0xE0,
"DIAERESIS", 0xE1,
"HAMZA", 0xE2,
"FUNCTIONAL", 0xE3,
"APL", 0xE4,
"TELEGRAPH", 0xE5,
"MAKSURA", 0xE6,
"RIGHTWARDS", 0xE7,
"JUNGSEONG", 0xE8,
"TILDE", 0xE9,
"DOWN", 0xEA,
"UP", 0xEB,
"LAO", 0xEC,
"BOPOMOFO", 0xED,
"VARIA", 0xEE,
"OXIA", 0xEF,
"GRAVE", 0xF0,
"TAMIL", 0xF1,
"KHAH", 0xF2,
"MACRON", 0xF3,
"MODIFIER", 0xF4,
"HALF", 0xF5,
"EQUAL", 0xF6,
"OMEGA", 0xF7,
"NUMBER", 0xF8,
"VOCALIC", 0xF9,
"ALPHA", 0xFA,
"LAM", 0xFB,
"ACCENT", 0xFC,
        );
1;



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT