From: Elliotte Rusty Harold (elharo@metalab.unc.edu)
Date: Tue Jan 20 2004 - 14:41:13 EST
At 10:26 AM -0800 1/20/04, Mike Ayers wrote:
BZZZT! Sorry, thanks for playing. You can't get the
advantages of both with no drawbacks. Given the octets 0x5B5B, how
would you know if you had "[[" or a Chinese character?
Actually, it looks like SCSU may do exactly that. If I'm
understanding the algorithms, it actually encodes most BMP characters
in a single byte, compressing quite a bit better than my naive idea
to switch between UTF-8 and UTF-16.
All schemes I've seen do involve some sort of flag characters in the
data stream to switch between different code ranges. As long as you
can keep the number of flag characters added down below the savings,
you're good to go. My original idea was to simply use a null to
switch between ASCII and UTF-16. SCSU looks a lot more sophisticated.
Of course, neither of those schemes will compress truly random data,
but most data isn't random.
> However, I would like the translation into and out of this format to
> be at least as fast as the translation between UTF-8 and UTF-16 the
> class is currently performing on every call to setValue and getValue,
> ideally faster.
Hmmm - again, this may be asking for too much. The
UTF-8/UTF-16 transform is pretty simple. Is it bogging you down?
It is a noticeable point in my profiling. I really did have to make a
choice between speed and space here. According to
http://www.unicode.org/notes/tn6/#Performance it looks like SCSU is
faster for a lot of languages but 10-25% slower for English, French
and Japanese than the UTF-8/UTF-16 conversion.
If your application will use much more of European or
non-European languages, then just use UTF-8 or UTF-16 respectively,
as you won't really lose much space that way.
This is a class library which is relatively language neutral. If a
Chinese programmer uses it, I'd expect they'd have a lot of data in
Chinese. So far most of the adoption that I know about is in the
Americas and Europe, but there's no reason it has to stay that way,
especially if I can reduce the footprint for CJK text.
If space usage is random/indeterminate/evenly distributed, then,
assuming that any given string is primarily in a single language, a
TLV type discriminating between UTF-8 and UTF-16 should do nicely.
Precede each string with an OR of the MSB (0 for UTF-8, 1 for UTF-16)
and the length, in octets, of the string (therefore max of 32,767
octets per string, which shouldn't ordinarily be a problem).
That would be a problem. I definitely cannot rule out long strings,
where long is quite a bit larger than 32K.
-- Elliotte Rusty Harold elharo@metalab.unc.edu Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA
This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 16:36:26 EST