In a message dated 2001-10-31 15:54:34 Pacific Standard Time, 
dstarner98@aasaa.ofe.org writes:
>  Has any one done worst case scenarios on SCSU, with respect to other
>  methods of encoding Unicode characters?
In addition to theoretical worst-case scenarios, it might also be worthwhile 
to consider the practical limitations of certain encoders.  SCSU does not 
require encoders to be able to utilize the entire syntax of SCSU, so in the 
extreme case, a maximally stupid SCSU "compressor" could simply quote every 
character as Unicode:
    SQU hi-byte lo-byte SQU hi-byte lo-byte ...
This would result in a uniform 50% expansion over UTF-16, which is pretty bad.
On a more realistic level, even "good" SCSU encoders are not required by the 
specification to be infinitely intelligent and clever in their encoding.  For 
example, I think my encoder is pretty decent, but it encodes the Japanese 
example in UTF #6 in 180 bytes rather than the 178 bytes illustrated in the 
report.  This is because the Japanese data contains a couple of sequences of 
the form <kanji><kana><kanji>, where <kanji> are not compressible and <kana> 
are.  If there is only one <kana> between the two <kanji>, as in this case, 
it is more efficient to just stay in Unicode mode for the <kana> rather than 
switching modes.  My encoder isn't currently bright enough to figure this 
out.  In the worst case, then, a long-enough sequence of 
<kana><kanji><kana><kanji>... would take 5 bytes for every 2 BMP characters.
-Doug Ewell
 Fullerton, California
This archive was generated by hypermail 2.1.2 : Fri Nov 02 2001 - 02:14:40 EST