Re: Worst case scenarios on SCSU

From: DougEwell2@cs.com
Date: Fri Nov 02 2001 - 01:20:14 EST


In a message dated 2001-10-31 15:54:34 Pacific Standard Time,
dstarner98@aasaa.ofe.org writes:

> Has any one done worst case scenarios on SCSU, with respect to other
> methods of encoding Unicode characters?

In addition to theoretical worst-case scenarios, it might also be worthwhile
to consider the practical limitations of certain encoders. SCSU does not
require encoders to be able to utilize the entire syntax of SCSU, so in the
extreme case, a maximally stupid SCSU "compressor" could simply quote every
character as Unicode:

    SQU hi-byte lo-byte SQU hi-byte lo-byte ...

This would result in a uniform 50% expansion over UTF-16, which is pretty bad.

On a more realistic level, even "good" SCSU encoders are not required by the
specification to be infinitely intelligent and clever in their encoding. For
example, I think my encoder is pretty decent, but it encodes the Japanese
example in UTF #6 in 180 bytes rather than the 178 bytes illustrated in the
report. This is because the Japanese data contains a couple of sequences of
the form <kanji><kana><kanji>, where <kanji> are not compressible and <kana>
are. If there is only one <kana> between the two <kanji>, as in this case,
it is more efficient to just stay in Unicode mode for the <kana> rather than
switching modes. My encoder isn't currently bright enough to figure this
out. In the worst case, then, a long-enough sequence of
<kana><kanji><kana><kanji>... would take 5 bytes for every 2 BMP characters.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Nov 02 2001 - 02:14:40 EST