Re: Unicode & space in programming & l10n

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Sep 17 2006 - 22:03:45 CDT

Next message: Don Osborn: "RE: Unicode & space in programming & l10n"

Previous message: Steve Summit: "Re: Unicode & space in programming & l10n"
In reply to: Mark Davis: "Re: Unicode & space in programming & l10n"
Next in thread: Asmus Freytag: "Re: Unicode & space in programming & l10n"
Reply: Asmus Freytag: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis wrote:

> Frankly, I think the reason why SCSU and BOCU never got a lot of
> traction is related to #1 on my list. That is, in the vast majority of
> cases UTF-16 or UTF-8 have storage characteristics that are good
> enough -- it's just not really worth taking extra steps to squeeze out
> more.

UTF-8 is practically always good enough for me, but then I'm not the one
writing articles complaining about size "penalties" or ASCII
compatibility. Apparently at least some people either have different
storage needs, or haven't overcome the myths.

> The only small-string compression scheme to gain fairly wide
> acceptance, for different reasons, is PunyCode.

I'm actually quite impressed with how elegantly and efficiently Punycode
encodes URNs under the numerous constraints that that implies. But if I
remember correctly, it's not suitable for arbitrary text, such as this
e-mail.

> Of course, ZIP and related compressions do a pretty good job on any of
> these languages encoding in Unicode, so they can be applied to reduce
> sizes for any and all of them, in appropriate circumstances.

The usual problem with general-purpose compression is that the output is
no longer "text," but some sort of compressed blob that must be
explicitly operated upon before it is usable as text. SCSU or BOCU-1
text can be interpreted directly, without passing it through a separate
decompressor, and I can even open and save SCSU-encoded text files
directly in SC UniPad (thanks to the encoder and decoder I gave them
years ago :).

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/
RFC 4645  *  UTN #14

Next message: Don Osborn: "RE: Unicode & space in programming & l10n"
Previous message: Steve Summit: "Re: Unicode & space in programming & l10n"
In reply to: Mark Davis: "Re: Unicode & space in programming & l10n"
Next in thread: Asmus Freytag: "Re: Unicode & space in programming & l10n"
Reply: Asmus Freytag: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 22:06:55 CDT