RE: UTF-c

From: Doug Ewell (doug@ewellic.org)
Date: Tue Feb 22 2011 - 09:31:20 CST

  • Next message: Doug Ewell: "Posting attachments (was: Re: [unicode] UTF-c)"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    >> Both the complexity of SCSU, and the importance of the complexity of
    >> SCSU, continue to be highly overrated.
    >
    > It is complex because of the codepage switching mechanism of SCSU and
    > its internal "magic" codepage tables.

    OK, at least that's a different "complexity" argument from the usual
    ones. But Cropley's "Alphabet" table is certainly no improvement over
    the SCSU tables in this regard.

    >> Part of the apparent simplicity of Cropley's algorithm, as viewed
    >> from his "Preliminary Proposal" HTML page, is that it omits a proper
    >> description of the code-page switching mechanism, as well as the
    >> "magic number" definitions of the code pages and the control bytes
    >> needed to introduce them. These are present in the sample code, but
    >> to see them, you have to paw through the UTF-8 conversion code and
    >> UI.
    >
    > Yes, but he did not implement any codepage swithcing mechanism at all
    > (the only thing is that it created in fact a single code for producing
    > dozens of distinctg encodings, eachi one requiring its distinctive
    > "BOM-like" prefix (but ill-designed in my opinion).

    So a given document in this encoding can encode only one additional
    64-block with one byte per character? Then it's not a replacement for
    anything. Even ISO 2022 lets you switch blocks.

    >>> and it is still very highly compressible with standard compression
    >>> algorithms while still allowing very fast processing in memory in
    >>> its decompressed encoded form :
    >>
    >> I see no metrics or sample data to back this up.
    >
    > I've played the code myself on this. It's easy to read, but lots of
    > improvements can be done on it (also to secure it, because it's not
    > safe the way it is implemented).

    I'm focusing on Cropley's algorithm, partially defined as it is by his
    sample code (which is always a red flag for a specification), not his
    coding skills. What sort of numbers do your tests show for compression
    speed and size?

    > Forget the FS/GS/RS/US hack for his "BOM", it's not even needed in his
    > proposal (and it would also make his encoding incompatible with MIME
    > restrictions and with lots of transport protocols), just like the
    > magic table hwich mooks more like a sample and is not evolutive enough
    > to be really self-adaptive to various texts and to newer versions of
    > the Unicode standard and new scripts (SCSU also has the same latter
    > caveat, which has also been known since long in ISO 2022 using similar
    > magic values for code page switching, one good reason for which it
    > became unmaintainable).

    "Unmaintainable," at least in the case of SCSU, is not the same as "we
    choose not to maintain it." And indeed, for all its problems, ISO 2022
    was maintained continuously through 2004 via its International Register.

    > A good candidate for replacement of UTF-8 should not need any magic
    > table, and should be self-adaptive.

    Which is one of many reasons why Cropley's algorithm cannot be
    considered as a replacement for UTF-8, if such a thing is even possible.
    It can be considered as a new compression scheme, but then it has to
    measure up favorably to the existing ones, and I don't see any real
    improvements there.

    > It is possible to do that, but code switching also has its own caveats
    > (notably in fast search/compare algorithms, such as those used in
    > "diff" programs and in versioning systems for code repositories,
    > because of multiple and unpredictable binary representations of the
    > same characters: it's something that immediately disqualifies SCSU if
    > code switching can occur at any place in encoded texts).

    Section 1 of the SCSU spec says, "It is not intended as processing
    format or as general purpose interchange format." There's little value
    in beating up SCSU for something it is not meant to do.

    (That said, I've been playing with an implementation I call "line-safe
    SCSU" that is fully conformant to UTS #6, but adds the constraint that
    the state of the SCSU machine (modes and windows) must be reset at the
    end of each line. That removes at least some of the nondeterminism.)

    Now if Cropley's algorithm is being presented as a replacement or
    alternative to UTF-8, then it does need to be evaluated on criteria like
    these, and Suzuki-san's observations become very relevant.

    Some readers know that I created lots of encodings like this, about 10
    years ago. Since that time, UTF-8 has extended its lead as the dominant
    Unicode interchange format, and the usage profile for compressing
    Unicode text has continued to move toward general-purpose compression,
    with SCSU still available as the only unencumbered, byte-based Unicode
    compression format. An initiative like Cropley's needs to be "better
    enough" than either UTF-8 or SCSU to displace either one, and this one
    very simply is not.

    --
    Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
    RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Tue Feb 22 2011 - 09:35:02 CST