This is in two parts: Part one is on compression, Part two is about
a Unicode string compare function that I am looking for. (So don't
be too fast with that delete key. :)
Part 1)
Most compression algorithms use a buffer to check for reacurring patterns.
Since text from a code page of unicode will typically have the same
initial byte for every character you have effectivly 1/2 the number of
characters in your buffer as you would if it were just ASCII for example.
So, to compress material to (approxamently) the same size, you will need to
double your buffer size. If you are using a hashtable, tree, or other such
lookup device in your compression algorithm, you shouln't need to change
the size as they are indexing to the same number of characters as before.
(You might need to switch your char's to shorts however in your buffer.)
The other option is to use runlength encoding to represent the code page
of unicode which is being used. (Which would encode effectivly every other
byte) You could take the output of this and the remaining bytes and run
it through some sort of general compression algorithm. (Such as one of
the LZ varients.) If you have any questions, please feel free to e-mail
me direct.
Part 2)
I am looking for some sort of string / character compare function which
will work with unicode. I realize that some of the sort orders for some
of the various character sets get kind of weird. It would be nice if
there were a compare function available. So, if you have one that you
would be willing to share, I would love to hear from you.
Thanks,
Art
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT