From: David Starner (prosfilaes@gmail.com)
Date: Wed Jun 02 2010 - 08:14:48 CDT
On Tue, Jun 1, 2010 at 11:04 PM, Kannan Goundan <kannan@cakoose.com> wrote:
>
> I'm trying to come up with a compact encoding for Unicode strings for
> data serialization purposes. The goals are fast read/write and small
> size.
>
> The plan:
> 1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates).
> 2. Non-BMP code points are encoded as three bytes
> - The first two bytes are code points from the BMP's UTF-16 surrogate
> range (11 bits of data)
> - The next byte provides an additional 8 bits of data.
Why? I can't imagine any use-case where you're dealing with enough
data outside the BMP to make using this instead of UTF-16 a real win.
You have a case where you're dealing with a large amount of Egyptian
Hieroglyphics or obscure Chinese characters, and it's worth adding the
complexity to go from four bytes to three in some cases, but not use
SCSU or a standard compression like zlib's?
-- Kie ekzistas vivo, ekzistas espero.
This archive was generated by hypermail 2.1.5 : Wed Jun 02 2010 - 08:15:55 CDT