Compressing Vietnamese with SCSU

From: Doug Ewell (dewell@compuserve.com)
Date: Tue Apr 18 2000 - 00:06:39 EDT


According to Unicode Technical Report #6, the Standard Compression
Scheme for Unicode (SCSU) was designed to compress text written in
"languages using small alphabets [that] contain runs of characters
that are coded close together in Unicode." This includes almost all
languages written in the Latin script, but one notable exception seems
to be Vietnamese.

Using pre-composed characters, Vietnamese employs characters from four
Unicode blocks (in addition to Basic Latin):

    U+00C0 to U+00FD Latin-1 Supplement
    U+0102 to U+0169 Latin Extended-A
    U+01A1 to U+01B0 Latin Extended-B
    U+1EA0 to U+1EF9 Latin Extended Additional

This requires the use of at least four windows in SCSU.

I tested my newly written encoder with a fairly large Vietnamese file
(59,871 characters) and came up with the following results:

    VISCII: 59,871 bytes
    UTF-16: 119,744 bytes (including BOM)
    UTF-8: 79,269 bytes
    SCSU: 67,781 bytes

The SCSU encoding is more than 13% larger than the legacy encoding
(VISCII). I was hoping to do better than that. Almost all of the
additional bytes are tag bytes for switching windows.

My encoder makes very limited use of non-locking shifts. It uses them
only with static windows, and only when the character being encoded fits
in a static window and does NOT fit in any currently defined dynamic
window. Only 148 single quotes were used in this 59K text file.

The problem is, none of the four extended-Latin blocks is used much less
frequently than the other three, so none is an obvious candidate for
automatic single-quoting. It would be necessary to look ahead in the
file, perhaps arbitrarily far, to determine the best strategy.

Bottom line: I am trying to decide whether to improve my encoder's use
of non-locking shifts to try to get better compression of Vietnamese
(a language which I believe will benefit greatly from increased use of
Unicode), or whether this rather mediocre compression is unavoidable
due to the widely scattered Vietnamese code points in Unicode.

Has anyone who has implemented a SCSU encoder tested it with Vietnamese?
Would anyone be willing to do so? I can provide the 59K text file to
anyone who is interested.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT