Re: Compression and Unicode [was: Name Compression]

From: Juliusz Chroboczek (
Date: Fri May 12 2000 - 09:45:03 EDT

Asmus Freytag <>:

AF> You apparently don't seem to realize that SCSU bridges the gap between
AF> an 8-bit based LZW and a 16-bit encoded Unicode text, by removing the
AF> extra redundancy that is part of the endoding (sequences of every other
AF> byte being null) and not a redundancy in the content. The output of SCSU
AF> should be sent to LZW for block compression where that's desired.

After re-reading the SCSU (sorry for the typo) definition, I realise
that the use of SCSU as a predictor for LZW or arithmetic coding does
indeed make sense. Contrary to what I said earlier, I have convinced
myself that using the SCSU in this manner might be a significant win
for some scripts.

This fully answers my question about the rationale for the SCSU.

AF> Another design point of SCSU is that it is editable (you can
AF> replace a piece in the middle, w/o having to change the stuff at
AF> the beginning or the end.)

Agreed, although you need to carefully keep track of the state of the
encoder when you do this.

AF> Another factor is probably (I didn't check this) the different
AF> ability to do semi random accesses into the middle of compressed
AF> text.

I don't think this is important. If you encode each character name
separately, you get as much random access ability as you might
reasonably need. Most character names are a dozen or so characters
long, and decoding all of one name in order to access a substring is
a most reasonable approach.

AF> PS:) coders will be coders, they like to invent new coding schemes

He nodded, wiping a tear with a distraught gesture.

Thanks for your answers,


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT