From: Hans Aberg (haberg@math.su.se)
Date: Thu Sep 21 2006 - 09:28:14 CDT
On 21 Sep 2006, at 15:34, Doug Ewell wrote:
>> Another method, which enables compressing both characters (code
>> points) and natural language words (sequences of code points),
>> might be to make modified UTF-8, where the leading byte admits
>> indicating two categories of numbers. (Continued below.)
>
> Whatever you do, do NOT call it "UTF-anything."
Don't worry. :-)
> I'm currently compressing names in the Unicode character list using
> a variable-length byte-based scheme that encodes common words like
> LETTER in 1 byte and rare words like SPATHI in two bytes.
So then, why not (if this is not what you already is doing) just take
a large English text body, and compute the statistics of the words in
it. Then sort the list, putting the more frequent words first, and
give the words the number they have in this list. Then apply UTF-8 to
that (or some other variable length encoding), or if words that are
infrequent are not encoded at all, but just represented as character
by character, a character/word-modification, and you have your
variable length word encoding. (The modification of UTF-8, giving
separate numbers to words and characters, that comes to my mind, is
that the leading byte is given the form 1...10nx..., where say 0 =
character, 1 = word. Points are that small non-negative integers are
given shorter binary representation, and that the different character/
word numberings are kept separate. So it is easy to play around with
other modifications.)
> The range of trail bytes is allowed to overlap the range of lead
> bytes, since backward parsing doesn't matter for this specific
> application.
The idea of UTF-8 to avoid trail-byte range overlap probably isn't
important in these compression schemes. So then more bit-efficient
encodings might be developed. For example, one variable byte, and one
variable bit.
> It has some characteristics in common with UTFs, but it isn't a UTF
> and I pledge not to call it one.
I'm not in the naming business. :-)
Hans Aberg
This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 09:30:50 CDT