Re: Least used parts of BMP.

From: David Starner (prosfilaes@gmail.com)
Date: Wed Jun 02 2010 - 08:14:48 CDT

Next message: Philippe Verdy: "re: Least used parts of BMP."

Previous message: Doug Ewell: "Re: Least used parts of BMP."
In reply to: Kannan Goundan: "Least used parts of BMP."
Next in thread: Philippe Verdy: "re: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Tue, Jun 1, 2010 at 11:04 PM, Kannan Goundan <kannan@cakoose.com> wrote:
>
> I'm trying to come up with a compact encoding for Unicode strings for
> data serialization purposes. The goals are fast read/write and small
> size.
>
> The plan:
> 1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates).
> 2. Non-BMP code points are encoded as three bytes
> - The first two bytes are code points from the BMP's UTF-16 surrogate
> range (11 bits of data)
> - The next byte provides an additional 8 bits of data.

Why? I can't imagine any use-case where you're dealing with enough
data outside the BMP to make using this instead of UTF-16 a real win.
You have a case where you're dealing with a large amount of Egyptian
Hieroglyphics or obscure Chinese characters, and it's worth adding the
complexity to go from four bytes to three in some cases, but not use
SCSU or a standard compression like zlib's?

--
Kie ekzistas vivo, ekzistas espero.

Next message: Philippe Verdy: "re: Least used parts of BMP."
Previous message: Doug Ewell: "Re: Least used parts of BMP."
In reply to: Kannan Goundan: "Least used parts of BMP."
Next in thread: Philippe Verdy: "re: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 02 2010 - 08:15:55 CDT