Re: Proposing UTF-21/24

From: David Starner (prosfilaes@gmail.com)
Date: Mon Jan 22 2007 - 05:06:20 CST

  • Next message: Karl Pentzlin: "Proposing a DOUBLE HYPHEN punctuation mark"

    On 1/21/07, Ruszlan Gaszanov <ruszlan@ather.net> wrote:
    > David Starner wrote:
    >
    > > Frankly, any long-term storage and interchange that doesn't use a
    > > general purpose compression scheme is wasteful; bzip compression runs
    > > about 3 bits per character for alphabetic text and less than 7 bits
    > > per character for ideographic text. Bzip also includes some degree of
    > > error detection in that, but there are many better tools for serious
    > > error detection.
    >
    > Tell plain text processing tools designers that they should support *all* compression algorithms ever designed internally. Or tell the users they should install all compression tools ever made on their system in order to be able to read plain text data. This kind of defeats the idea of plain text as such.

    Where does "*all* compression algorithms" come from? Zip support is
    ubiquitous on personal computers; gzip isn't that far behind. (zlib
    support, due to it being part of the HTTP standard, is ubiquitous;
    given zlib support, support for files compressed by gzip is trivial.)
    On Unix, many plain text processing tools will call gzip to decompress
    gzipped data. Long-term storage by definition isn't something you're
    using every day; it's not too much to ask to run a decompression
    utility over the data before using it.

    For interchange, people download compressed text all the time, for two
    reasons. First, text rarely appears as one lone text files;
    collections of text files, text files with associated illustrations,
    HTML with external CSS files, etc. are more common. Secondly, when
    people download non-compressed text, it's frequently the issue that
    it's so small as not to matter. War and Peace is 3 MB of text, and
    that's not much for DVD storage or cable download. If size matters
    people use compression; if size doesn't matter, they aren't going to
    switch from their current UTFs to UTF-24.

    > > As for fixed length encodings,
    > > again, the existing UTF-32 tends to play second fiddle to UTF-8 and
    > > UTF-16.
    >
    > That's because we do not have a general-purpose fixed-length encoding scheme for Unicode. UTF-32 is only feasable for internal processing on 32/64-bit architecture, but a way to wasteful to be of any practical use for data storage or interchange.

    For internal processing on a 16-bit architecture, UTF-32 would be
    easier than UTF-24, since each character fits in 2 words instead of
    1-1/2. You're the only person on this thread who seems to care about
    the distinction between 32 bits per character and 24 bits per
    character. The vast majority of my data is written in the Latin
    script--English, German, French, Esperanto, IPA, etc. Turning text
    that averages 8-12 bits per character in UTF-8 into a format that uses
    more than double the bits strikes me as wasteful; the difference
    between UTF-24 and UTF-32 doesn't make much difference to me at that
    point.

    > Besides, as pointed out in another post, proposed UTF-24 would perform much better then UTF-8/16 on texts making extensive use of characters outside BMP, and would even be more compact then UTF-8 for East-Asian text.

    Who cares about data outside the BMP? For UTF-24 to beat UTF-16 on
    texts with characters outside the BMP, it needs to have as much text
    outside the BMP as inside. That's horribly rare.



    This archive was generated by hypermail 2.1.5 : Mon Jan 22 2007 - 05:09:14 CST