Thanks for your comments.
We aren't worried about the transition to and from space for supplementary
characters, since as a fraction of all text they will be exceedingly rare (<
0.01%, our estimate).
As to Korean, it might save some storage to always reset at space, but
(a) I don't see an obvious way to modify the algorithm and preserve binary
order -- which is, after all, the whole point.
(b) the algorithm only takes one "long jump" for a single space, which is
the typical case. Adding a Resync byte always would probably degrade the
storage. One would have to run the numbers again.
Mark
----- Original Message -----
From: "Carl W. Brown" <cbrown@xnetinc.com>
To: "Unicode" <unicode@unicode.org>
Sent: Friday, June 01, 2001 09:26
Subject: RE: Compression - binary ordered
> Mark,
>
> This sounds like a great idea. I was wondering however, if spaces in
non-plane 0 characters set will cause problems with the compression
efficiency. Maybe you should consider a special case for spaces.
>
> Maybe you could use something like offsetting the displacement values to
accommodate special markings.
>
> Encoded Offset Actual Offset
> +2 +1
> +1 0
> 0 Space character
> -1 Restart next character from offset 0 (Resync)
> -2 -1
> -3 -2
>
> If nothing else it should give better Korean compression.
>
> The Resync could be used prior to a null character as a string terminator.
Nulls not preceded by a resync are not termination nulls.
>
> This scheme would require a slight modification to comparison routines.
However, you should still be able to compare without full decoding. Resync
will cause a problem with compares. The space insertion will only require
minor adjustments.
>
> Carl
>
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Mark Davis
> Sent: Thursday, May 31, 2001 11:27 PM
> To: Unicode
> Cc: Unicore
> Subject: Compression - binary ordered
>
>
> As a by-product of our recent work on collation, we developed a method of
> Unicode compression that is similar to SCSU, in that small alphabets are
> about a byte per character and large alphabets are about two bytes per
> character.
>
> The main difference from SCSU is that this method preserves binary order.
As
> this is a hot topic right now, I thought it might be of interest. The
latest
> draft description is on http://oss.software.ibm.com/icu/develop/bocu.htm.
> Comments are welcome.
>
> Mark
> —————
>
> πάντων µέτρον ἄνθρωπος — Πρωταγόρας
>
> [http://www.macchiato.com]
>
>
>
>
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT