From: Mark Davis ☕ (mark@macchiato.com)
Date: Fri Jun 04 2010 - 10:34:28 CDT
In a compression format, that doesn't matter; you can't expect random
access, nor many of the other features of UTF-8.
The minimal expectation for these kinds of simple compression is that when
you write a string with a particular *write* method, and then read it back
with the corresponding *read* method, you get exactly the original string
contents back, and you consume exactly as many bytes as you had written.
There are really no other guarantees.
Mark
— Il meglio è l’inimico del bene —
On Fri, Jun 4, 2010 at 06:35, Otto Stolz <Otto.Stolz@uni-konstanz.de> wrote:
> Hello,
>
> Am 2010-06-03 07:07, schrieb Kannan Goundan:
>
> This is currently what I do (I was referring to this as the "compact
>> UTF-8-like encoding"). The one difference is that I put all the
>> marker bits in the first byte (instead of in the high bit of every
>> byte):
>> 0xxxxxxx
>> 10xxxxxx xyyyyyyy
>> 110xxxxx xxyyyyyy yzzzzzzz
>>
>
> The problem with this encoding is that the trailing bytes
> are not clearly marked: they may start with any of
> '0', '10', or '110'; only '111' would mark a byte
> unambiguously as a trailing one.
>
> In contrast, in UTF-8 every single byte carries a marker
> that unambiguously marks it as either a single ASCII byte,
> a starting, or a continuation byte; hence you have not to
> go back to the beginning of the whole data stream to recognize,
> and decode, a group of bytes.
>
> Best wishes,
> Otto Stolz
>
>
>
>
This archive was generated by hypermail 2.1.5 : Fri Jun 04 2010 - 10:37:03 CDT