From: Kannan Goundan (kannan@cakoose.com)
Date: Tue Jun 01 2010 - 22:04:24 CDT
I'm trying to come up with a compact encoding for Unicode strings for
data serialization purposes. The goals are fast read/write and small
size.
The plan:
1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates).
2. Non-BMP code points are encoded as three bytes
- The first two bytes are code points from the BMP's UTF-16 surrogate
range (11 bits of data)
- The next byte provides an additional 8 bits of data.
Unfortunately, this doesn't quite work because it only gives me 19
bits to encode non-BMP code points, but I need 20 bits. To solve this
problem, I'm planning on stealing a bit of code space from the BMP the
private-use area. If I did, then:
- I'd get the bits needed to encoded the Non-BMP in 3 bytes.
- The stolen code points of the private-use area would now have to be
encoded using 3 bytes.
I chose the private-use area because I assumed it would be the least
commonly used, so having these code points require 3 bytes instead of
2 bytes wasn't that big a deal. Does this sound reasonable? Do
people suggest a different section of the BMP to steal from, or a
different encoding altogether?
Thanks for reading.
-- Kannan
P.S. I actually have two encodings. One is similar to UTF-8 in that
it's ASCII-biased. The encoding described above is intended for
non-ASCII-biased data. The programmer selects which encoding to use
based on what he thinks the data looks like.
This archive was generated by hypermail 2.1.5 : Wed Jun 02 2010 - 00:20:19 CDT