Dear Arnt:
> > ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/
> I'm looking for something related: reference code or the algorithm
> for converting between UTF16 and the compact Reuters format, which
> I've heard is either part of Unicode 2.1 or scheduled to become part
> of Unicode. Is that available anywhere?
Go one directory up and take the exit "SCSU".
ftp://ftp.unicode.org/Public/PROGRAMS/SCSU/ contains a sample Java
implementation which can convert text files back and forth between
UTF-16 and SCSU.
I think that initialDynamicOffset[1] in SCSU.java has to be changed
from 0x0100, // Latin Extended A
into 0x00C0, // combined partial Latin-1/-A
to align it with http://www.unicode.org/unicode/reports/tr6.html
but I haven't heard the final word on this.
The SCSU/*.java user interface is a bit object-oriented:
$ cd SCSU
$ javac *.java
$ echo test > test.csu
$ java CompressMain /expand test.csu < /dev/null
Expanded test.csu: 5 bytes to test.txt 5 chars. Ratio: 200%.
Done. Press enter to exit
$ od -t x1 test.txt
0000000 fe ff 00 74 00 65 00 73 00 74 00 0a
0000014
$ java CompressMain /compress test.txt < /dev/null
Compressed test.txt: 5 chars to test.csu 5 bytes. Ratio: 50%.
Done. Press enter to exit
$ od -t x1 test.csu
0000000 74 65 73 74 0a
0000005
If you also take a look at http://czyborra.com/scsu/ you will find a
simpler deflator http://czyborra.com/scsu/scsu.c in C that translates
SCSU standard input into UTF-8 standard output and does not require
you to store the text in files nor to use certain extensions:
$ echo schön | scsu
schön
You could easily change its putwchar function to output UTF-16 instead
of UTF-8, see http://czyborra.com/utf/#UTF-16
I did not yet program a compressor to SCSU because it is probably an
exercise in combinatorial optimization to do that well and I currently
find other questions more important.
A trivial UTF-16 to SCSU converter simply inserts an SCU single change
to Unicode mode in the beginning:
$ sed '1s/^\(þÿ\)*//' test.txt | od -t x1
0000000 0f 00 74 00 65 00 73 00 74 00 0a
0000013
Cheers
Roman
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT