Re: Least used parts of BMP.

From: Michael D'Errico ([email protected])
Date: Wed Jun 02 2010 - 22:15:46 CDT

Next message: Doug Ewell: "Re: Least used parts of BMP."

Previous message: Mark Davis ☕: "Re: Least used parts of BMP."
In reply to: Kannan Goundan: "Re: Least used parts of BMP."
Next in thread: Doug Ewell: "Re: Least used parts of BMP."
Reply: Doug Ewell: "Re: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

If you want a really fast alternate encoding, you could encode all of
Unicode in at most 3 bytes. Use the high bit as a "continuation" bit
and the lower 7 bits as the data.

ASCII gets passed through unchanged.

For code points between U+0080 and U+3FFF, split the value into the
high 7 bits and low 7 bits. Set the highest bit on the first byte
and follow it by the second 7 bits with the high bit cleared (which
will look like ASCII).

For the rest of Unicode from U+2000 thru U+10FFFF, split the value
into three 7-bit value, set the high bit on the first two bytes and
leave the high bit cleared on the lowest byte.

So if you have a code point with binary value 00xx xxxx xyyy yyyy,
encode it as 1xxx xxxx followed by 0yyy yyyy.

And a code point with binary value 000x xxxx xxyy yyyy yzzz zzzz is
encoded as 1xxx xxxx 1yyy yyyy 0zzz zzzz.

This is essentially the encoding used for tags in Abstract Syntax
Notation One (ASN.1) which has been around more than 20 years, so
there should be no IP claims to it.

Mike

Kannan Goundan wrote:
> Thanks to everyone for the detailed responses. I definitely
> appreciate the feedback on the broader issue (even though my question
> was very narrow).
>
> I should clarify my use case a little. I'm creating a generic data
> serialization format similar to Google Protocol Buffers and Apache
> Thrift. Other than Unicode strings, the format supports many other
> data types -- all of which are serialized in a custom format. Some
> data types will contain a lot of string data while others will contain
> very little. As is the case with other tools in this area, standard
> compression techniques can be applied to the entire payload as a
> separate pass (e.g. gzip).
>
> I can see how there are benefits to using one of the standard
> encodings. However, at this point, my goals are basically fast
> serialization/deserialization and small size. I might eventually see
> the error in my ways (and feel like an idiot for ignoring your
> advice), but in the interest of not wasting your time any more than I
> already have, I should mention that suggestions to stick to a standard
> encoding will fall on mostly deaf ears.
>
> For my current use case, I don't need to perform random accesses in
> serialized data so I don't see a need to make the space-usage
> compromises that UTF-8 and UTF-16 make. A more compact UTF-8-like
> encoding will get you ASCII with one byte, the first 1/4 of the BMP
> with two bytes, and everything else with three bytes. A more compact
> UTF-16-like format gets the BMP in 2 bytes (minus some PUA) and
> everything else in 3. Maybe not huge savings, but if you're of the
> opinion that sticking to a standard doesn't buy you anything... :-)
>
> I'll definitely take a closer look at SCSU. Hopefully the encoding
> speed is good enough. Most of the other serialization tools just
> blast out UTF-8, making them very fast on strings that contain mostly
> ASCII. I hope SCSU doesn't get me killed in ASCII-only encoding
> benchmarks (http://wiki.github.com/eishay/jvm-serializers/). I really
> do like the idea of making my format less ASCII-biased, though. And,
> like I said before, I don't care much about sticking to a standard
> encoding -- if stock SCSU ends up being too slow or complex, I might
> still be able to use techniques from SCSU in a custom encoding.
>
> (Philippe: when I said I needed 20 bits, I meant that I needed 20 bits
> for the stuff after the BMP. I fully intend for my encoding to handle
> every Unicode codepoint, minus surrogates.)
>
> Thanks again, everyone.
> -- Kannan
>
> On Wed, Jun 2, 2010 at 13:12, Asmus Freytag <[email protected]> wrote:
>> On 6/2/2010 12:25 AM, Kannan Goundan wrote:
>>> On Tue, Jun 1, 2010 at 23:30, Asmus Freytag <[email protected]> wrote:
>>>
>>>> Why not use SCSU?
>>>>
>>>> You get the small size and the encoder/decoder aren't that
>>>> complicated.
>>>>
>>> Hmm... I had skimmed the SCSU document a few days ago. At the time it
>>> seemed a bit more complicated than I wanted. What's nice about UTF-8
>>> and UTF-16-like encodings is that the space usage is predictable.
>>>
>>> But maybe I'll take a closer look. If a simple SCSU encoder can do
>>> better than more "standard" encodings 99% of the time, then maybe it's
>>> worth it...
>>>
>>>
>> It will, because it's designed to compress commonly used characters.
>>
>> Start with the existing sample code and optimize it. Many features of SCSU
>> are optional, using them gives slightly better compression, but you don't
>> always have to use them and the result is still legal SCSU. Sometimes
>> leaving out a feature can make your encoder a tad simpler, although I found
>> that you can be pretty fast with decent performance.
>>
>> A./
>>
>
>
>

Next message: Doug Ewell: "Re: Least used parts of BMP."
Previous message: Mark Davis ☕: "Re: Least used parts of BMP."
In reply to: Kannan Goundan: "Re: Least used parts of BMP."
Next in thread: Doug Ewell: "Re: Least used parts of BMP."
Reply: Doug Ewell: "Re: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 02 2010 - 22:18:08 CDT