Re: SCSU implementations

From: Asmus Freytag (
Date: Sun Apr 09 2000 - 15:33:03 EDT

At 12:18 AM 4/7/00 -0800, Adrian Havill wrote:

>I haven't checked the Java source in a while, but if memory serves, the
>Java source only used one or two "windows"/registers, rather than all

Despite the comment, the java code now uses all eight, but in a round robin

>The code in the C code above uses all eight and uses a LRU strategy to
>avoid redefining new windows more than necessary.

>By using all eight windows/registers, I could sometimes beat the Java
>reference code in terms of output SCSU when the source text used more than
>one range of Unicode outside of the predefined ranges.

I would love to learn of a realistic scenario where this situation occurs.
It's trivially possible to generate test strings that work to the strenghts
(or design features) of a given compressor, but the task is to create one
that works well with real-life text.

>Also, the code can do the
>RLE compression of repeating characters and the escaping of controls
>mentioned in the TR, as well as the U+FEFF sig recommendations mentioned in
>the revised TR.

Some of these features would not be interpreted correctly by vanilla decoders,
so it's important to use them only where the same extensions are supported on
both ends.

>Testing. There isn't a lot of real world SCSU code out there, so you
>basically have to test against the examples in the TR and against what the
>Java reference code does.

I'd love to learn of anything that people have discovered by testing against my


PS: I have not been able to double check my local copy of the SCSU final code
against the server copy - but I'm convinced they are the same. Just in case,
I've appended the source code here (including the slightly out of date

As you can see, the compressor now starts with window 3 and cycles through
as they are redefined. No attempt is made to take account of which other
have been used - this is based on the idea that SCSU texts that require lots of
redefined windows most often need only a few, and then don't use a lot of pre-
defined windows. Real-life test data that contradict these assumptions would be

     static int iNextWindow = 3;

     /** redefine a window so it surrounds a given character value
         For now, this function uses window 3 exclusively (window 4
         for extended windows);
         @return true if a window was successfully defined
         @param ch - character around which window is positioned
         @param out - output byte array
         @param iOut - starting offset in output byte array
     private boolean positionWindow(int ch, byte [] out, boolean fUnicodeMode)
         throws IllegalInputException, EndOfOutputException
         int iWin = iNextWindow % 8; // simple LRU
         int iPosition = 0;

         // iPosition 0 is a reserved value
         if (ch < 0x80)
             throw new Assert("ch < 0x80");
             //return false;

         // Check the fixed offsets
         for (int i = 0; i < fixedOffset.length; i++)
             if (ch >= fixedOffset[i] && ch < fixedOffset[i] + 0x80)
                 iPosition = i;

         if (iPosition != 0)
             // DEBUG
             Debug.out("FIXED position is ", iPosition + 0xF9);

             // ch fits in a fixed offset window position
             dynamicOffset[iWin] = fixedOffset[iPosition];
             iPosition += 0xF9;
         else if (ch < 0x3400)
             // calculate a window position command and set the offset
             iPosition = ch >>> 7;
             dynamicOffset[iWin] = ch & 0xFF80;

iPosition="+iPosition+" for char", ch);
         else if (ch < 0xE000)
             // attempt to place a window where none can go
             return false;
         else if (ch <= 0xFFFF)
             // calculate a window position command, accounting
             // for the gap in position values, and set the offset
             iPosition = ((ch - gapOffset)>>> 7);

             dynamicOffset[iWin] = ch & 0xFF80;

iPosition="+iPosition+" for char", ch);
             // if we get here, the character is in the extended range.
             // Always use Window 4 to define an extended window

             iPosition = (ch - 0x10000) >>> 7;
             // DEBUG
             Debug.out("Try position Window at ", iPosition);

             iPosition |= iWin << 13;
             dynamicOffset[iWin] = ch & 0x1FFF80;

         // Outputting window defintion command for the general cases
         if ( iPosition < 0x100 && iOut < out.length-1)
             out[iOut++] = (byte) ((fUnicodeMode ? UD0 : SD0) + iWin);
             out[iOut++] = (byte) (iPosition & 0xFF);
         // Output an extended window definiton command
         else if ( iPosition >= 0x100 && iOut < out.length - 2)

             Debug.out("Setting extended window at ", iPosition);
             out[iOut++] = (byte) (fUnicodeMode ? UDX : SDX);
             out[iOut++] = (byte) ((iPosition >>> 8) & 0xFF);
             out[iOut++] = (byte) (iPosition & 0xFF);
             throw new EndOfOutputException();
         return true;

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT