From: Mike Ayers (mike.ayers@tumbleweed.com)
Date: Tue Jan 20 2004 - 13:26:30 EST
> Last night it occurred to me it might be possible to design an
> internal storage format for this class which had better memory usage
> characteristics. In particular I'd like ASCII data to occupy only a
> single byte, and all other BMP characters from 128 to 65535 to occupy
> only two bytes. Non-BMP characters could be stored in surrogate pairs.
BZZZT! Sorry, thanks for playing. You can't get the advantages of
both with no drawbacks. Given the octets 0x5B5B, how would you know if you
had "[[" or a Chinese character?
> 3. This is all completely private to one class. No data in this form
> will be passed on the wire. None will be exposed via the public API
> which is completely based on Java strings (that is, UTF-16).
Good idea. We have too many external encodings anyway.
> However, I would like the translation into and out of this format to
> be at least as fast as the translation between UTF-8 and UTF-16 the
> class is currently performing on every call to setValue and getValue,
> ideally faster.
Hmmm - again, this may be asking for too much. The UTF-8/UTF-16
transform is pretty simple. Is it bogging you down?
> Has anyone done any work on Unicode formats for this use-case? Does
> anyone have any references or ideas to share?
If your application will use much more of European or non-European
languages, then just use UTF-8 or UTF-16 respectively, as you won't really
lose much space that way. If space usage is random/indeterminate/evenly
distributed, then, assuming that any given string is primarily in a single
language, a TLV type discriminating between UTF-8 and UTF-16 should do
nicely. Precede each string with an OR of the MSB (0 for UTF-8, 1 for
UTF-16) and the length, in octets, of the string (therefore max of 32,767
octets per string, which shouldn't ordinarily be a problem). Then encode
the string in your efficiency-chosen format. Since you have a length, you
can skip the terminator. The resulting structure is at most one byte longer
than the string would have been had it been encoded as straight UTF-8 or
UTF-16, and is double octet aligned, so native UTF-16 functions can be used
if they exist.
HTH,
/|/|ike
This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 14:14:59 EST