RE: Proposing UTF-21/24

From: Ruszlan Gaszanov (ruszlan@ather.net)
Date: Sun Jan 21 2007 - 09:37:56 CST

  • Next message: Ruszlan Gaszanov: "Regulating PUA."

    David Starner wrote:

    > Frankly, any long-term storage and interchange that doesn't use a
    > general purpose compression scheme is wasteful; bzip compression runs
    > about 3 bits per character for alphabetic text and less than 7 bits
    > per character for ideographic text. Bzip also includes some degree of
    > error detection in that, but there are many better tools for serious
    > error detection.

    Tell plain text processing tools designers that they should support *all* compression algorithms ever designed internally. Or tell the users they should install all compression tools ever made on their system in order to be able to read plain text data. This kind of defeats the idea of plain text as such.

    > I think it notable that
    > UTF-7, which was designed to avoid undesired sequences for email tends
    > to be poorly supported; for example, Google mail seems to have mangled
    > the UTF-7 in your post. Instead, a general purpose encoding, usually
    > Base64, is used to encodes both the text and the attachments without
    > concern for the details of the contents.

    Well, either *all* mail applications should support UTF-7 or *all* mail applications should encode non-ASCII in message body as base64/quoted-printable or non-8bit-clean SMTP gateways should be outlawed. Until that happens, US-ASCII will remain the only 100% compatible format for e-mail message headers and body text and user will be stuck with a dilema whether sending as UTF-8 (which some mail clients refuse to encode) or sending as UTF-7 (which other clients may not support) or sending as attachment in some widely-supported format (like RTF or PDF).

    > Likewise, current encodings designed with a extreme
    > concern for size, like SCSU and BOCU, frequently aren't used, because
    > UTF-8 or UTF-16 combined with a general purpose compression scheme
    > works much better for any long text.

    Well, SCSU and BOCU are too complex to be considered plain text encodings, and do not provide significant advantages comparing to general-purpose compression formats, while being much more specialized. Therefore, their usability is questionable.

    > As for fixed length encodings,
    > again, the existing UTF-32 tends to play second fiddle to UTF-8 and
    > UTF-16.

    That's because we do not have a general-purpose fixed-length encoding scheme for Unicode. UTF-32 is only feasable for internal processing on 32/64-bit architecture, but a way to wasteful to be of any practical use for data storage or interchange. Besides, as pointed out in another post, proposed UTF-24 would perform much better then UTF-8/16 on texts making extensive use of characters outside BMP, and would even be more compact then UTF-8 for East-Asian text.

    Ruszlán



    This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 09:41:01 CST