Re: Compression - binary ordered

From: Mark Davis (markdavis34@home.com)
Date: Fri Jun 01 2001 - 12:54:18 EDT

Next message: Edward Cherlin: "Silliness (was RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))"
Previous message: Mark Davis: "Re: Compression - binary ordered"
In reply to: Carl W. Brown: "RE: Compression - binary ordered"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Thanks for your comments.

We aren't worried about the transition to and from space for supplementary
characters, since as a fraction of all text they will be exceedingly rare (<
0.01%, our estimate).

As to Korean, it might save some storage to always reset at space, but
(a) I don't see an obvious way to modify the algorithm and preserve binary
order -- which is, after all, the whole point.
(b) the algorithm only takes one "long jump" for a single space, which is
the typical case. Adding a Resync byte always would probably degrade the
storage. One would have to run the numbers again.

Mark

----- Original Message -----
From: "Carl W. Brown" <cbrown@xnetinc.com>
To: "Unicode" <unicode@unicode.org>
Sent: Friday, June 01, 2001 09:26
Subject: RE: Compression - binary ordered

> Mark,
>
> This sounds like a great idea. I was wondering however, if spaces in
non-plane 0 characters set will cause problems with the compression
efficiency. Maybe you should consider a special case for spaces.
>
> Maybe you could use something like offsetting the displacement values to
accommodate special markings.
>
> Encoded Offset Actual Offset
> +2 +1
> +1 0
> 0 Space character
> -1 Restart next character from offset 0 (Resync)
> -2 -1
> -3 -2
>
> If nothing else it should give better Korean compression.
>
> The Resync could be used prior to a null character as a string terminator.
Nulls not preceded by a resync are not termination nulls.
>
> This scheme would require a slight modification to comparison routines.
However, you should still be able to compare without full decoding. Resync
will cause a problem with compares. The space insertion will only require
minor adjustments.
>
> Carl
>
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Mark Davis
> Sent: Thursday, May 31, 2001 11:27 PM
> To: Unicode
> Cc: Unicore
> Subject: Compression - binary ordered
>
>
> As a by-product of our recent work on collation, we developed a method of
> Unicode compression that is similar to SCSU, in that small alphabets are
> about a byte per character and large alphabets are about two bytes per
> character.
>
> The main difference from SCSU is that this method preserves binary order.
As
> this is a hot topic right now, I thought it might be of interest. The
latest
> draft description is on http://oss.software.ibm.com/icu/develop/bocu.htm.
> Comments are welcome.
>
> Mark
> —————
>
> πάντων µέτρον ἄνθρωπος — Πρωταγόρας
>
> [http://www.macchiato.com]
>
>
>
>

Next message: Edward Cherlin: "Silliness (was RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))"
Previous message: Mark Davis: "Re: Compression - binary ordered"
In reply to: Carl W. Brown: "RE: Compression - binary ordered"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT