From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Jun 04 2010 - 12:22:42 CDT
On 6/4/2010 8:34 AM, Mark Davis ☕ wrote:
> In a compression format, that doesn't matter; you can't expect random 
> access, nor many of the other features of UTF-8.
>
> The minimal expectation for these kinds of simple compression is that 
> when you write a string with a particular /write/ method, and then 
> read it back with the corresponding /read/ method, you get exactly the 
> original string contents back, and you consume exactly as many bytes 
> as you had written. There are really no other guarantees.
Actually, SCSU makes an additional guarantee, which is that you can edit 
the compressed string. In other words, you can insert a substring such 
that the new string remains a valid compressed string and the parts 
preceding and following the insertion, when read, match the 
corresponding portion of the original after decoding. I remember that 
this was an important design criterion for the precursor RCSU.  Their 
implementation required the ability to deliver a "patch" to a compressed 
string, something that isn't possible with many other compression formats.
So there is a sliding scale in features, each compression method being 
designed to address the specific requirements of given application.
A./
>
> Mark
>
> — Il meglio è l’inimico del bene —
>
>
> On Fri, Jun 4, 2010 at 06:35, Otto Stolz <Otto.Stolz@uni-konstanz.de 
> <mailto:Otto.Stolz@uni-konstanz.de>> wrote:
>
>     Hello,
>
>     Am 2010-06-03 07:07, schrieb Kannan Goundan:
>
>         This is currently what I do (I was referring to this as the
>         "compact
>         UTF-8-like encoding").  The one difference is that I put all the
>         marker bits in the first byte (instead of in the high bit of every
>         byte):
>           0xxxxxxx
>           10xxxxxx xyyyyyyy
>           110xxxxx xxyyyyyy yzzzzzzz
>
>
>     The problem with this encoding is that the trailing bytes
>     are not clearly marked: they may start with any of
>     '0', '10', or '110'; only '111' would mark a byte
>     unambiguously as a trailing one.
>
>     In contrast, in UTF-8 every single byte carries a marker
>     that unambiguously marks it as either a single ASCII byte,
>     a starting, or a continuation byte; hence you have not to
>     go back to the beginning of the whole data stream to recognize,
>     and decode, a group of bytes.
>
>     Best wishes,
>      Otto Stolz
>
>
>
>
This archive was generated by hypermail 2.1.5 : Fri Jun 04 2010 - 12:24:37 CDT