From: Mark Davis ☕ (mark@macchiato.com)
Date: Wed Jun 02 2010 - 22:12:16 CDT
An alternative that I've used is:
- Serialize every unsigned integer as a sequence of 7 bits, with the top
bit off for all but the last one.
- For signed integers, shift left by 1 bit, then invert if the original
was negative, then serialize as unsigned.
- Serialize a string as an integer length followed by a sequence of code
points expressed as integer deltas.
- For the deltas, set Previous=0 and loop, where each delta = current
- (Previous with the last 6 bits set to 0x40).
- Serialize floats/doubles as an integer exponent, then the sign+mantissa
(but in reverse byte order, eg MSF).
This tends to produce pretty reasonable compression given that it is very
simple code and a fast transform.
Mark
— Il meglio è l’inimico del bene —
On Wed, Jun 2, 2010 at 18:11, Kannan Goundan <kannan@cakoose.com> wrote:
> Thanks to everyone for the detailed responses. I definitely
> appreciate the feedback on the broader issue (even though my question
> was very narrow).
>
> I should clarify my use case a little. I'm creating a generic data
> serialization format similar to Google Protocol Buffers and Apache
> Thrift. Other than Unicode strings, the format supports many other
> data types -- all of which are serialized in a custom format. Some
> data types will contain a lot of string data while others will contain
> very little. As is the case with other tools in this area, standard
> compression techniques can be applied to the entire payload as a
> separate pass (e.g. gzip).
>
> I can see how there are benefits to using one of the standard
> encodings. However, at this point, my goals are basically fast
> serialization/deserialization and small size. I might eventually see
> the error in my ways (and feel like an idiot for ignoring your
> advice), but in the interest of not wasting your time any more than I
> already have, I should mention that suggestions to stick to a standard
> encoding will fall on mostly deaf ears.
>
> For my current use case, I don't need to perform random accesses in
> serialized data so I don't see a need to make the space-usage
> compromises that UTF-8 and UTF-16 make. A more compact UTF-8-like
> encoding will get you ASCII with one byte, the first 1/4 of the BMP
> with two bytes, and everything else with three bytes. A more compact
> UTF-16-like format gets the BMP in 2 bytes (minus some PUA) and
> everything else in 3. Maybe not huge savings, but if you're of the
> opinion that sticking to a standard doesn't buy you anything... :-)
>
> I'll definitely take a closer look at SCSU. Hopefully the encoding
> speed is good enough. Most of the other serialization tools just
> blast out UTF-8, making them very fast on strings that contain mostly
> ASCII. I hope SCSU doesn't get me killed in ASCII-only encoding
> benchmarks (http://wiki.github.com/eishay/jvm-serializers/). I really
> do like the idea of making my format less ASCII-biased, though. And,
> like I said before, I don't care much about sticking to a standard
> encoding -- if stock SCSU ends up being too slow or complex, I might
> still be able to use techniques from SCSU in a custom encoding.
>
> (Philippe: when I said I needed 20 bits, I meant that I needed 20 bits
> for the stuff after the BMP. I fully intend for my encoding to handle
> every Unicode codepoint, minus surrogates.)
>
> Thanks again, everyone.
> -- Kannan
>
> On Wed, Jun 2, 2010 at 13:12, Asmus Freytag <asmusf@ix.netcom.com> wrote:
> > On 6/2/2010 12:25 AM, Kannan Goundan wrote:
> >>
> >> On Tue, Jun 1, 2010 at 23:30, Asmus Freytag <asmusf@ix.netcom.com>
> wrote:
> >>
> >>>
> >>> Why not use SCSU?
> >>>
> >>> You get the small size and the encoder/decoder aren't that
> >>> complicated.
> >>>
> >>
> >> Hmm... I had skimmed the SCSU document a few days ago. At the time it
> >> seemed a bit more complicated than I wanted. What's nice about UTF-8
> >> and UTF-16-like encodings is that the space usage is predictable.
> >>
> >> But maybe I'll take a closer look. If a simple SCSU encoder can do
> >> better than more "standard" encodings 99% of the time, then maybe it's
> >> worth it...
> >>
> >>
> >
> > It will, because it's designed to compress commonly used characters.
> >
> > Start with the existing sample code and optimize it. Many features of
> SCSU
> > are optional, using them gives slightly better compression, but you don't
> > always have to use them and the result is still legal SCSU. Sometimes
> > leaving out a feature can make your encoder a tad simpler, although I
> found
> > that you can be pretty fast with decent performance.
> >
> > A./
> >
>
>
>
This archive was generated by hypermail 2.1.5 : Wed Jun 02 2010 - 22:15:42 CDT