Re: DUDE-8, a compression proposal

From: Mark Davis (mark@macchiato.com)
Date: Sat Jun 23 2001 - 16:09:03 EDT


When we were doing BOCU, I thought of a way to allow semi-random access.
This also applies to DUDE-8.

Reserve a particular byte value (or sequence if you have to) as a
synchronization token. Every 33 (or so) characters, emit one of those
sequences. You can then access into a buffer, and backup at most 33 x 3
bytes to find a synchronization point. The overhead is only about 33% (if
you can use a single byte, as in BOCU).

A stronger notion of random access allows you to not only get to character
boundaries, but also to determine which character (from 0 .. n) you are on
in the buffer. That can also be done, with a little bit more work. Left to
the reader.

Mark

----- Original Message -----
From: "John Cowan" <cowan@mercury.ccil.org>
To: <unicode@unicode.org>
Sent: Friday, June 22, 2001 18:23
Subject: DUDE-8, a compression proposal

> It's DUDE-8! It's quick! It's easy! It does it all!
>
> The DUDE-8 algorithm compresses 21-bit Unicode Scalar Values into bytes.
> It is based on the DUDE(-6) algorithm currently being proposed for
> i18n of DNS names.
>
> 1. Let prev = 0.
> 2. For each Unicode Scalar value usv, let prev = prev xor USV.
> 3. Let the value of prev be expressed in 21 bits as
xxxxxxxyyyyyyyzzzzzzz.
> 4. Encode prev in 3 bytes as 0xxxxxxx 0yyyyyyy 1zzzzzzz.
> 5. Emit all non-zero bytes.
> 6. Repeat until done.
>
> The optional signature of DUDE-8 is 03 FD FF, which is how U+FEFF appears
at the
> beginning of a DUDE-8 compression stream.
>
> DUDE-8 is simpler and simpler than SCSU, but doesn't allow recovery from
> garbles or even partial random or backwards access.
>
> --
> John Cowan cowan@ccil.org
> One art/there is/no less/no more/All things/to do/with sparks/galore
> --Douglas Hofstadter
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT