From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon May 07 2007 - 11:51:48 CDT
Philippe Verdy wrote on Monday, May 07, 2007 8:53 AM
Subject: RE: UTN #31 and direct compression of code points
> In fact I am a bit puzzled by the comment on the second line of the sample
> code below:
> length = DecodeLength(&input);
> offset = DecodeOffset(&input); // same algorithm as DecodeLength
> For encoding the offset (in matches only), what is the use of bits 7 and
> 6?
> Couldn't we store up to 7 bits of the offset value (instead of 6 bits) in
> the same byte without requiring an extra byte?
>
> If so, the two functions DecodeLength() and DecodeOffset() need to be
> different.
Perhaps the gain is small.
> However I wonder if the choice of the fixed size little-endian 16-bit
> format
> for the first character in a literal is appropriate. Why couldn't it
> represented like a code points difference as used in the rest of the
> literal?
The algorithm given is clearly for compressing UTF-16 data. Look at the
sign test for three byte difference values. (It could be adjusted/corrected
to handle arbitrary codepoint differences.) I wonder if SCSU would
out-perform the algorithm on, say, Shavian.
Richard.
This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 11:53:16 CDT