From: Doug Ewell (dewell@adelphia.net)
Date: Wed Jan 21 2004 - 01:59:53 EST
Elliotte Rusty Harold <elharo at metalab dot unc dot edu> wrote:
>> BZZZT! Sorry, thanks for playing. You can't get the
>> advantages of both with no drawbacks. Given the octets 0x5B5B, how
>> would you know if you had "[[" or a Chinese character?
>
> Actually, it looks like SCSU may do exactly that. If I'm
> understanding the algorithms, it actually encodes most BMP characters
> in a single byte, compressing quite a bit better than my naive idea
> to switch between UTF-8 and UTF-16.
I too missed the point in Elliotte's original post that it was OK for
this transformation to be stateful. Since that is the case, SCSU
definitely will fit the bill.
> All schemes I've seen do involve some sort of flag characters in the
> data stream to switch between different code ranges. As long as you
> can keep the number of flag characters added down below the savings,
> you're good to go. My original idea was to simply use a null to
> switch between ASCII and UTF-16. SCSU looks a lot more sophisticated.
SCSU *can be* a lot more sophisticated, but as Markus noted, a subset of
full-blown SCSU will often achieve really good compression.
> Of course, neither of those schemes will compress truly random data,
> but most data isn't random.
No scheme will compress truly random data, at least not consistently.
>> Hmmm - again, this may be asking for too much. The
>> UTF-8/UTF-16 transform is pretty simple. Is it bogging you down?
>
> It is a noticeable point in my profiling. I really did have to make a
> choice between speed and space here. According to
> http://www.unicode.org/notes/tn6/#Performance it looks like SCSU is
> faster for a lot of languages but 10-25% slower for English, French
> and Japanese than the UTF-8/UTF-16 conversion.
If you are using the "mini" version of SCSU where Latin-1 characters are
stored as 1 byte each and everything else is stored as UTF-16 (using SCU
and UC0 tags to switch between modes), you ought to achieve really good
speed.
> If space usage is random/indeterminate/evenly distributed, then,
> assuming that any given string is primarily in a single language, a
> TLV type discriminating between UTF-8 and UTF-16 should do nicely.
> Precede each string with an OR of the MSB (0 for UTF-8, 1 for UTF-16)
> and the length, in octets, of the string (therefore max of 32,767
> octets per string, which shouldn't ordinarily be a problem).
>
> That would be a problem. I definitely cannot rule out long strings,
> where long is quite a bit larger than 32K.
Despite the often-stated claims that SCSU and BOCU-1 are "optimized for
short strings," they work just as well on arbitrarily long strings.
It's just that the performance of general-purpose compression schemes
gets *much* better as the input text gets larger, so the relative
benefit of SCSU and BOCU-1 (compared to GP compression) is greatly
reduced. But for an internal-storage need like Elliotte's, and
especially where speed and simplicity are important, the compression
formats look like winners.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Wed Jan 21 2004 - 03:32:31 EST