Re: SCSU implementations

From: Adrian Havill (havill@turbolinux.co.jp)
Date: Fri Apr 07 2000 - 13:22:07 EDT


Doug Ewell wrote:

> I'm looking for any references to encoders for the Standard Compression
> Scheme for Unicode (SCSU), *other* than the Java implementation on the
> Unicode FTP site.
>
> Either C or C++ source code or a Windows/DOS executable would be most
> welcome.

<URL:ftp://ftp.turbolinux.co.jp/pub/fugu/fugu.tar.gz>

Yet-another-iconv-like library, with the specific goal of being source
portable for _any_ Standard C environment. (The code was specifically
designed for the AS/400 EBCDIK environment originally) Make files/batch
scripts are included for GNU environments and MS NT nmake/cl environments
(I haven't maintained the MS batch/nmake scripts for some time, although
the GNU makefiles should work on a CYGWIN environment).

The SCSU code you want is in fugu/source/cesxfrm/uxfrm.c in the functions
prefixed with UZI (from SCSU to UTF-32) and UZO (from UTF-32 to SCSU).

To use the example iconv/native2ascii like app with SCSU (using conversion
to/from UTF-8 as an example, although any encoding/charset is allowed), do:

ucconv -ie utf-8 -oe x-scsu < in.utf-8.txt > out.scsu.txt

or

ucconv -ie x-scsu -oe utf-8 < in.scsu.txt > out.utf-8.txt

I haven't checked the Java source in a while, but if memory serves, the
Java source only used one or two "windows"/registers, rather than all
eight.

The code in the C code above uses all eight and uses a LRU strategy to
avoid redefining new windows more than necessary. Also, the code can do the
RLE compression of repeating characters and the escaping of controls
mentioned in the TR, as well as the U+FEFF sig recommendations mentioned in
the revised TR.

By using all eight windows/registers, I could sometimes beat the Java
reference code in terms of output SCSU when the source text used more than
one range of Unicode outside of the predefined ranges.

> I would also be interested to know if anyone on the list has implemented
> (or tried to implement) a SCSU encoder and what problems they ran into.

Testing. There isn't a lot of real world SCSU code out there, so you
basically have to test against the examples in the TR and against what the
Java reference code does.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT