From: Doug Ewell (dewell@adelphia.net)
Date: Sun Sep 17 2006 - 22:03:45 CDT
Mark Davis wrote:
> Frankly, I think the reason why SCSU and BOCU never got a lot of
> traction is related to #1 on my list. That is, in the vast majority of
> cases UTF-16 or UTF-8 have storage characteristics that are good
> enough -- it's just not really worth taking extra steps to squeeze out
> more.
UTF-8 is practically always good enough for me, but then I'm not the one
writing articles complaining about size "penalties" or ASCII
compatibility. Apparently at least some people either have different
storage needs, or haven't overcome the myths.
> The only small-string compression scheme to gain fairly wide
> acceptance, for different reasons, is PunyCode.
I'm actually quite impressed with how elegantly and efficiently Punycode
encodes URNs under the numerous constraints that that implies. But if I
remember correctly, it's not suitable for arbitrary text, such as this
e-mail.
> Of course, ZIP and related compressions do a pretty good job on any of
> these languages encoding in Unicode, so they can be applied to reduce
> sizes for any and all of them, in appropriate circumstances.
The usual problem with general-purpose compression is that the output is
no longer "text," but some sort of compressed blob that must be
explicitly operated upon before it is usable as text. SCSU or BOCU-1
text can be interpreted directly, without passing it through a separate
decompressor, and I can even open and save SCSU-encoded text files
directly in SC UniPad (thanks to the encoder and decoder I gave them
years ago :).
-- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/ RFC 4645 * UTN #14
This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 22:06:55 CDT