Amount of Space

From: William J Poser (wjposer@ldc.upenn.edu)
Date: Mon Jul 16 2007 - 15:01:17 CDT

  • Next message: Michael Maxwell: "RE: Generic base characters"

    Another issue is, what subset of Unicode are you going to use,
    and how badly do you need to compress? If you only need certain
    ranges, you may be able to find an ad hoc compression scheme
    that saves a lot of space. For example, if you need a range
    that encodes as three or four bytes in UTF-8 and otherwise only
    ASCII, you might save a lot of space simply by subtracting from
    the base codepoint of the range from each codepoint and adding
    it again on decompression. Depending on the case, the fact that
    a particular code represents ASCII or the upper range could be
    indicated either by markup or by downshifting by the base codepoint
    -128, so that any codepoint above 127 would be in the non-ascii
    range.

    The general point is, how to compress Unicode in general, where
    any character might occur, is one question. How to compress a
    particular subset may have a very different answer.

    Bill



    This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 15:02:08 CDT