From: D. Starner (shalesller@writeme.com)
Date: Wed Nov 26 2003 - 16:10:26 EST
> Use Base64 - it is stable through all normalisation forms.
The problem with Base64 (and worse yet, PUA characters for bytes), is that
it's inefficent. Base64 offers 6 bits per 8 (75%) on UTF-8, 6 bits per 16 (37%)
on UTF-16. You can get 15 bits per 16 (93%) on UTF-16 and 15 bits per 24 (62%)
on UTF-8 with the following scheme, and the only normalization is Hangul, which
is at least algorithmic.
You could remove the normalization and increase compression in the UTF-8
(but cost it in the SCSU case) by using low characters that don't decompose
or compose, but then you have to carry long lists of usable characters.
(The numbers get tricky, so I haven't run them.)
You could remove the normalization cost by using Plane 2, but any characters
on Plane 2 would be larger.
(Assuming your binary data is linearly distributed, the numbers are
easy, except for SCSU. I think astral windows in SCSU have little
effect when used on data like this.)
Base64 / CJK15 / CJ15
UTF-8 75% / 62% / 59%
UTF-16 37% / 93% / 78%
SCSU 75% / 93% / 78%(?)
CJK15:
Break the byte stream into 15 bit chunks. Let a be a 15-bit chunk and U
be the resulting Unicode character. Then
if a < 1800h then U = a + 3400h
else if A < 6800h then U = a - 1800h + 4E00h /* a + 3600h */
else U = a - 6800h + AC00h /* a + 4400h */
CJ15:
replace the last else with
else U = a - 6800h + 20000h /* a + 19800h */
-- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 17:07:08 EST