Least used parts of BMP.

From: Kannan Goundan (kannan@cakoose.com)
Date: Tue Jun 01 2010 - 22:04:24 CDT

Next message: Peter Constable: "RE: Greek letter "LAMDA"?"

Previous message: Mark Crispin: "Re: Greek letter "LAMDA"?"
Next in thread: Asmus Freytag: "Re: Least used parts of BMP."
Reply: Asmus Freytag: "Re: Least used parts of BMP."
Reply: David Starner: "Re: Least used parts of BMP."
Maybe reply: Philippe Verdy: "re: Least used parts of BMP."
Maybe reply: Kannan Goundan: "Re: Least used parts of BMP."
Maybe reply: Doug Ewell: "RE: Least used parts of BMP."
Maybe reply: Philippe Verdy: "RE: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I'm trying to come up with a compact encoding for Unicode strings for
data serialization purposes. The goals are fast read/write and small
size.

The plan:
1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates).
2. Non-BMP code points are encoded as three bytes
- The first two bytes are code points from the BMP's UTF-16 surrogate
range (11 bits of data)
- The next byte provides an additional 8 bits of data.

Unfortunately, this doesn't quite work because it only gives me 19
bits to encode non-BMP code points, but I need 20 bits. To solve this
problem, I'm planning on stealing a bit of code space from the BMP the
private-use area. If I did, then:
- I'd get the bits needed to encoded the Non-BMP in 3 bytes.
- The stolen code points of the private-use area would now have to be
encoded using 3 bytes.

I chose the private-use area because I assumed it would be the least
commonly used, so having these code points require 3 bytes instead of
2 bytes wasn't that big a deal. Does this sound reasonable? Do
people suggest a different section of the BMP to steal from, or a
different encoding altogether?

Thanks for reading.
-- Kannan

P.S. I actually have two encodings. One is similar to UTF-8 in that
it's ASCII-biased. The encoding described above is intended for
non-ASCII-biased data. The programmer selects which encoding to use
based on what he thinks the data looks like.

Next message: Peter Constable: "RE: Greek letter "LAMDA"?"
Previous message: Mark Crispin: "Re: Greek letter "LAMDA"?"
Next in thread: Asmus Freytag: "Re: Least used parts of BMP."
Reply: Asmus Freytag: "Re: Least used parts of BMP."
Reply: David Starner: "Re: Least used parts of BMP."
Maybe reply: Philippe Verdy: "re: Least used parts of BMP."
Maybe reply: Kannan Goundan: "Re: Least used parts of BMP."
Maybe reply: Doug Ewell: "RE: Least used parts of BMP."
Maybe reply: Philippe Verdy: "RE: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 02 2010 - 00:20:19 CDT