RE: Compression through normalization

From: D. Starner (shalesller@writeme.com)
Date: Wed Nov 26 2003 - 16:10:26 EST

Next message: Timothy Partridge: "Re: What is a process?"

Previous message: Peter Constable: "RE: Definitions"
Maybe in reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Use Base64 - it is stable through all normalisation forms.

The problem with Base64 (and worse yet, PUA characters for bytes), is that
it's inefficent. Base64 offers 6 bits per 8 (75%) on UTF-8, 6 bits per 16 (37%)
on UTF-16. You can get 15 bits per 16 (93%) on UTF-16 and 15 bits per 24 (62%)
on UTF-8 with the following scheme, and the only normalization is Hangul, which
is at least algorithmic.

You could remove the normalization and increase compression in the UTF-8
(but cost it in the SCSU case) by using low characters that don't decompose
or compose, but then you have to carry long lists of usable characters.
(The numbers get tricky, so I haven't run them.)
You could remove the normalization cost by using Plane 2, but any characters
on Plane 2 would be larger.
(Assuming your binary data is linearly distributed, the numbers are
easy, except for SCSU. I think astral windows in SCSU have little
effect when used on data like this.)

Base64 / CJK15 / CJ15
UTF-8 75% / 62% / 59%
UTF-16 37% / 93% / 78%
SCSU 75% / 93% / 78%(?)

CJK15:
Break the byte stream into 15 bit chunks. Let a be a 15-bit chunk and U
be the resulting Unicode character. Then
if a < 1800h then U = a + 3400h
else if A < 6800h then U = a - 1800h + 4E00h /* a + 3600h */
else U = a - 6800h + AC00h /* a + 4400h */

CJ15:
replace the last else with
else U = a - 6800h + 20000h /* a + 19800h */

-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Next message: Timothy Partridge: "Re: What is a process?"
Previous message: Peter Constable: "RE: Definitions"
Maybe in reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 17:07:08 EST