From: William J Poser (wjposer@ldc.upenn.edu)
Date: Mon Jul 16 2007 - 15:01:17 CDT
Another issue is, what subset of Unicode are you going to use,
and how badly do you need to compress? If you only need certain
ranges, you may be able to find an ad hoc compression scheme
that saves a lot of space. For example, if you need a range
that encodes as three or four bytes in UTF-8 and otherwise only
ASCII, you might save a lot of space simply by subtracting from
the base codepoint of the range from each codepoint and adding
it again on decompression. Depending on the case, the fact that
a particular code represents ASCII or the upper range could be
indicated either by markup or by downshifting by the base codepoint
-128, so that any codepoint above 127 would be in the non-ascii
range.
The general point is, how to compress Unicode in general, where
any character might occur, is one question. How to compress a
particular subset may have a very different answer.
Bill
This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 15:02:08 CDT