From: David Starner (prosfilaes@gmail.com)
Date: Mon Jan 22 2007 - 05:06:20 CST
On 1/21/07, Ruszlan Gaszanov <ruszlan@ather.net> wrote:
> David Starner wrote:
>
> > Frankly, any long-term storage and interchange that doesn't use a
> > general purpose compression scheme is wasteful; bzip compression runs
> > about 3 bits per character for alphabetic text and less than 7 bits
> > per character for ideographic text. Bzip also includes some degree of
> > error detection in that, but there are many better tools for serious
> > error detection.
>
> Tell plain text processing tools designers that they should support *all* compression algorithms ever designed internally. Or tell the users they should install all compression tools ever made on their system in order to be able to read plain text data. This kind of defeats the idea of plain text as such.
Where does "*all* compression algorithms" come from? Zip support is
ubiquitous on personal computers; gzip isn't that far behind. (zlib
support, due to it being part of the HTTP standard, is ubiquitous;
given zlib support, support for files compressed by gzip is trivial.)
On Unix, many plain text processing tools will call gzip to decompress
gzipped data. Long-term storage by definition isn't something you're
using every day; it's not too much to ask to run a decompression
utility over the data before using it.
For interchange, people download compressed text all the time, for two
reasons. First, text rarely appears as one lone text files;
collections of text files, text files with associated illustrations,
HTML with external CSS files, etc. are more common. Secondly, when
people download non-compressed text, it's frequently the issue that
it's so small as not to matter. War and Peace is 3 MB of text, and
that's not much for DVD storage or cable download. If size matters
people use compression; if size doesn't matter, they aren't going to
switch from their current UTFs to UTF-24.
> > As for fixed length encodings,
> > again, the existing UTF-32 tends to play second fiddle to UTF-8 and
> > UTF-16.
>
> That's because we do not have a general-purpose fixed-length encoding scheme for Unicode. UTF-32 is only feasable for internal processing on 32/64-bit architecture, but a way to wasteful to be of any practical use for data storage or interchange.
For internal processing on a 16-bit architecture, UTF-32 would be
easier than UTF-24, since each character fits in 2 words instead of
1-1/2. You're the only person on this thread who seems to care about
the distinction between 32 bits per character and 24 bits per
character. The vast majority of my data is written in the Latin
script--English, German, French, Esperanto, IPA, etc. Turning text
that averages 8-12 bits per character in UTF-8 into a format that uses
more than double the bits strikes me as wasteful; the difference
between UTF-24 and UTF-32 doesn't make much difference to me at that
point.
> Besides, as pointed out in another post, proposed UTF-24 would perform much better then UTF-8/16 on texts making extensive use of characters outside BMP, and would even be more compact then UTF-8 for East-Asian text.
Who cares about data outside the BMP? For UTF-24 to beat UTF-16 on
texts with characters outside the BMP, it needs to have as much text
outside the BMP as inside. That's horribly rare.
This archive was generated by hypermail 2.1.5 : Mon Jan 22 2007 - 05:09:14 CST