From: David Starner (prosfilaes@gmail.com)
Date: Sun Jan 21 2007 - 07:26:19 CST
On 1/20/07, Ruszlan Gaszanov <ruszlan@ather.net> wrote:
> Why would we need a new UTF? Well, of all currently available encoding schemed for Unicode, only UTF-32 is fixed-length. However, while it might be convenient for internal processing on 32/64bit platforms, 11 spare bits per code unit is much too wasteful for long-term storage and interchange. Again if we have spare bits, why not just as well make them useful for, let+IBk-s say, error detection or avoiding undesired sequences (like NUL).
You don't need fixed-length for long-term storage and interchange.
Frankly, any long-term storage and interchange that doesn't use a
general purpose compression scheme is wasteful; bzip compression runs
about 3 bits per character for alphabetic text and less than 7 bits
per character for ideographic text. Bzip also includes some degree of
error detection in that, but there are many better tools for serious
error detection.
For avoiding undesired sequences, UTF-8 does that quite well. Many
tools that need undesired sequences avoided tend to also assume that
0x00-0x7f is ASCII, which UTF-8 supports. I think it notable that
UTF-7, which was designed to avoid undesired sequences for email tends
to be poorly supported; for example, Google mail seems to have mangled
the UTF-7 in your post. Instead, a general purpose encoding, usually
Base64, is used to encodes both the text and the attachments without
concern for the details of the contents.
To call for a new UTF requires evidence that someone will actually use
it. As pointed above, UTF-7, which avoids non-mail safe characters, is
rarely used. Likewise, current encodings designed with a extreme
concern for size, like SCSU and BOCU, frequently aren't used, because
UTF-8 or UTF-16 combined with a general purpose compression scheme
works much better for any long text. As for fixed length encodings,
again, the existing UTF-32 tends to play second fiddle to UTF-8 and
UTF-16. I don't see the demand for the existing fixed length encoding
to be enough to introduce a second one.
This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 07:30:25 CST