Re: Least used parts of BMP.

From: Mark Davis ☕ (mark@macchiato.com)
Date: Fri Jun 04 2010 - 10:34:28 CDT

  • Next message: Doug Ewell: "RE: Least used parts of BMP."

    In a compression format, that doesn't matter; you can't expect random
    access, nor many of the other features of UTF-8.

    The minimal expectation for these kinds of simple compression is that when
    you write a string with a particular *write* method, and then read it back
    with the corresponding *read* method, you get exactly the original string
    contents back, and you consume exactly as many bytes as you had written.
    There are really no other guarantees.

    Mark

    — Il meglio è l’inimico del bene —

    On Fri, Jun 4, 2010 at 06:35, Otto Stolz <Otto.Stolz@uni-konstanz.de> wrote:

    > Hello,
    >
    > Am 2010-06-03 07:07, schrieb Kannan Goundan:
    >
    > This is currently what I do (I was referring to this as the "compact
    >> UTF-8-like encoding"). The one difference is that I put all the
    >> marker bits in the first byte (instead of in the high bit of every
    >> byte):
    >> 0xxxxxxx
    >> 10xxxxxx xyyyyyyy
    >> 110xxxxx xxyyyyyy yzzzzzzz
    >>
    >
    > The problem with this encoding is that the trailing bytes
    > are not clearly marked: they may start with any of
    > '0', '10', or '110'; only '111' would mark a byte
    > unambiguously as a trailing one.
    >
    > In contrast, in UTF-8 every single byte carries a marker
    > that unambiguously marks it as either a single ASCII byte,
    > a starting, or a continuation byte; hence you have not to
    > go back to the beginning of the whole data stream to recognize,
    > and decode, a group of bytes.
    >
    > Best wishes,
    > Otto Stolz
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jun 04 2010 - 10:37:03 CDT