Mark,
This sounds like a great idea. I was wondering however, if spaces in non-plane 0 characters set will cause problems with the compression efficiency. Maybe you should consider a special case for spaces.
Maybe you could use something like offsetting the displacement values to accommodate special markings.
Encoded Offset Actual Offset
+2 +1
+1 0
0 Space character
-1 Restart next character from offset 0 (Resync)
-2 -1
-3 -2
If nothing else it should give better Korean compression.
The Resync could be used prior to a null character as a string terminator. Nulls not preceded by a resync are not termination nulls.
This scheme would require a slight modification to comparison routines. However, you should still be able to compare without full decoding. Resync will cause a problem with compares. The space insertion will only require minor adjustments.
Carl
-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Mark Davis
Sent: Thursday, May 31, 2001 11:27 PM
To: Unicode
Cc: Unicore
Subject: Compression - binary ordered
As a by-product of our recent work on collation, we developed a method of
Unicode compression that is similar to SCSU, in that small alphabets are
about a byte per character and large alphabets are about two bytes per
character.
The main difference from SCSU is that this method preserves binary order. As
this is a hot topic right now, I thought it might be of interest. The latest
draft description is on http://oss.software.ibm.com/icu/develop/bocu.htm.
Comments are welcome.
Mark
—————
πάντων µέτρον ἄνθρωπος — Πρωταγόρας
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT