Re: Least used parts of BMP.

From: Asmus Freytag (
Date: Fri Jun 04 2010 - 12:22:42 CDT

  • Next message: Kenneth Whistler: "Re: Hexadecimal digits"

    On 6/4/2010 8:34 AM, Mark Davis ☕ wrote:
    > In a compression format, that doesn't matter; you can't expect random
    > access, nor many of the other features of UTF-8.
    > The minimal expectation for these kinds of simple compression is that
    > when you write a string with a particular /write/ method, and then
    > read it back with the corresponding /read/ method, you get exactly the
    > original string contents back, and you consume exactly as many bytes
    > as you had written. There are really no other guarantees.
    Actually, SCSU makes an additional guarantee, which is that you can edit
    the compressed string. In other words, you can insert a substring such
    that the new string remains a valid compressed string and the parts
    preceding and following the insertion, when read, match the
    corresponding portion of the original after decoding. I remember that
    this was an important design criterion for the precursor RCSU. Their
    implementation required the ability to deliver a "patch" to a compressed
    string, something that isn't possible with many other compression formats.

    So there is a sliding scale in features, each compression method being
    designed to address the specific requirements of given application.

    > Mark
    > — Il meglio è l’inimico del bene —
    > On Fri, Jun 4, 2010 at 06:35, Otto Stolz <
    > <>> wrote:
    > Hello,
    > Am 2010-06-03 07:07, schrieb Kannan Goundan:
    > This is currently what I do (I was referring to this as the
    > "compact
    > UTF-8-like encoding"). The one difference is that I put all the
    > marker bits in the first byte (instead of in the high bit of every
    > byte):
    > 0xxxxxxx
    > 10xxxxxx xyyyyyyy
    > 110xxxxx xxyyyyyy yzzzzzzz
    > The problem with this encoding is that the trailing bytes
    > are not clearly marked: they may start with any of
    > '0', '10', or '110'; only '111' would mark a byte
    > unambiguously as a trailing one.
    > In contrast, in UTF-8 every single byte carries a marker
    > that unambiguously marks it as either a single ASCII byte,
    > a starting, or a continuation byte; hence you have not to
    > go back to the beginning of the whole data stream to recognize,
    > and decode, a group of bytes.
    > Best wishes,
    > Otto Stolz

    This archive was generated by hypermail 2.1.5 : Fri Jun 04 2010 - 12:24:37 CDT