Re: UTF16 <=> Reuters format?

From: Roman Czyborra (
Date: Wed Sep 30 1998 - 09:38:58 EDT

Dear Arnt:

> >
> I'm looking for something related: reference code or the algorithm
> for converting between UTF16 and the compact Reuters format, which
> I've heard is either part of Unicode 2.1 or scheduled to become part
> of Unicode. Is that available anywhere?

Go one directory up and take the exit "SCSU". contains a sample Java
implementation which can convert text files back and forth between
UTF-16 and SCSU.

I think that initialDynamicOffset[1] in has to be changed
from 0x0100, // Latin Extended A
into 0x00C0, // combined partial Latin-1/-A
to align it with
but I haven't heard the final word on this.

The SCSU/*.java user interface is a bit object-oriented:

        $ cd SCSU
        $ javac *.java
        $ echo test > test.csu
        $ java CompressMain /expand test.csu < /dev/null
        Expanded test.csu: 5 bytes to test.txt 5 chars. Ratio: 200%.
        Done. Press enter to exit
        $ od -t x1 test.txt
        0000000 fe ff 00 74 00 65 00 73 00 74 00 0a
        $ java CompressMain /compress test.txt < /dev/null
        Compressed test.txt: 5 chars to test.csu 5 bytes. Ratio: 50%.
        Done. Press enter to exit
        $ od -t x1 test.csu
        0000000 74 65 73 74 0a

If you also take a look at you will find a
simpler deflator in C that translates
SCSU standard input into UTF-8 standard output and does not require
you to store the text in files nor to use certain extensions:

        $ echo sch÷n | scsu

You could easily change its putwchar function to output UTF-16 instead
of UTF-8, see

I did not yet program a compressor to SCSU because it is probably an
exercise in combinatorial optimization to do that well and I currently
find other questions more important.

A trivial UTF-16 to SCSU converter simply inserts an SCU single change
to Unicode mode in the beginning:

        $ sed '1s/^\(■ \)*//' test.txt | od -t x1
        0000000 0f 00 74 00 65 00 73 00 74 00 0a


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT