Re: Least used parts of BMP.

From: Mark Davis ☕ (mark@macchiato.com)
Date: Fri Jun 04 2010 - 10:34:28 CDT

Next message: Doug Ewell: "RE: Least used parts of BMP."

Previous message: Mark E. Shoulson: "Re: Hexadecimal digits"
In reply to: Otto Stolz: "Re: Least used parts of BMP."
Next in thread: Asmus Freytag: "Re: Least used parts of BMP."
Reply: Asmus Freytag: "Re: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In a compression format, that doesn't matter; you can't expect random
access, nor many of the other features of UTF-8.

The minimal expectation for these kinds of simple compression is that when
you write a string with a particular *write* method, and then read it back
with the corresponding *read* method, you get exactly the original string
contents back, and you consume exactly as many bytes as you had written.
There are really no other guarantees.

Mark

— Il meglio è l’inimico del bene —

On Fri, Jun 4, 2010 at 06:35, Otto Stolz <Otto.Stolz@uni-konstanz.de> wrote:

> Hello,
>
> Am 2010-06-03 07:07, schrieb Kannan Goundan:
>
> This is currently what I do (I was referring to this as the "compact
>> UTF-8-like encoding"). The one difference is that I put all the
>> marker bits in the first byte (instead of in the high bit of every
>> byte):
>> 0xxxxxxx
>> 10xxxxxx xyyyyyyy
>> 110xxxxx xxyyyyyy yzzzzzzz
>>
>
> The problem with this encoding is that the trailing bytes
> are not clearly marked: they may start with any of
> '0', '10', or '110'; only '111' would mark a byte
> unambiguously as a trailing one.
>
> In contrast, in UTF-8 every single byte carries a marker
> that unambiguously marks it as either a single ASCII byte,
> a starting, or a continuation byte; hence you have not to
> go back to the beginning of the whole data stream to recognize,
> and decode, a group of bytes.
>
> Best wishes,
> Otto Stolz
>
>
>
>

Next message: Doug Ewell: "RE: Least used parts of BMP."
Previous message: Mark E. Shoulson: "Re: Hexadecimal digits"
In reply to: Otto Stolz: "Re: Least used parts of BMP."
Next in thread: Asmus Freytag: "Re: Least used parts of BMP."
Reply: Asmus Freytag: "Re: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 04 2010 - 10:37:03 CDT