Re: Least used parts of BMP.

From: David Starner (
Date: Wed Jun 02 2010 - 08:14:48 CDT

    On Tue, Jun 1, 2010 at 11:04 PM, Kannan Goundan <> wrote:
    > I'm trying to come up with a compact encoding for Unicode strings for
    > data serialization purposes.  The goals are fast read/write and small
    > size.
    > The plan:
    > 1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates).
    > 2. Non-BMP code points are encoded as three bytes
    > - The first two bytes are code points from the BMP's UTF-16 surrogate
    > range (11 bits of data)
    > - The next byte provides an additional 8 bits of data.

    Why? I can't imagine any use-case where you're dealing with enough
    data outside the BMP to make using this instead of UTF-16 a real win.
    You have a case where you're dealing with a large amount of Egyptian
    Hieroglyphics or obscure Chinese characters, and it's worth adding the
    complexity to go from four bytes to three in some cases, but not use
    SCSU or a standard compression like zlib's?

