Re: [unicode] UTF-c

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Feb 22 2011 - 08:32:07 CST

  • Next message: Doug Ewell: "RE: UTF-c"

    2011/2/21 Doug Ewell <doug@ewellic.org>:
    > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    >
    >> And anyway it is also much simpler to understand and easier to
    >> implement correctly (not like the sample code given here) than SCSU,
    >
    > I don't buy this.  A simple SCSU encoder, which achieves most of the
    > benefits of a complex one, is nearly as simple as Cropley's algorithm.
    > Both the complexity of SCSU, and the importance of the complexity of
    > SCSU, continue to be highly overrated.

    It is complex because of the codepage switching mechanism of SCSU and
    its internal "magic" codepage tables.

    > Part of the apparent simplicity of Cropley's algorithm, as viewed from
    > his "Preliminary Proposal" HTML page, is that it omits a proper
    > description of the code-page switching mechanism, as well as the "magic
    > number" definitions of the code pages and the control bytes needed to
    > introduce them.  These are present in the sample code, but to see them,
    > you have to paw through the UTF-8 conversion code and UI.

    Yes, but he did not implement any codepage swithcing mechanism at all
    (the only thing is that it created in fact a single code for producing
    dozens of distinctg encodings, eachi one requiring its distinctive
    "BOM-like" prefix (but ill-designed in my opinion).

    >> and it is still very highly compressible with standard compression
    >> algorithms while still allowing very fast processing in memory in its
    >> decompressed encoded form :
    >
    > I see no metrics or sample data to back this up.

    I've played the code myself on this. It's easy to read, but lots of
    improvements can be done on it (also to secure it, because it's not
    safe the way it is implemented).

    > How does Cropley's
    > algorithm perform with mixed scripts (say Greek and Cyrillic), with
    > embedded punctuation in the U+2000 block, with Deseret and other
    > alphabets omitted from the Alphabet table, with larger alphabets where
    > multiple 64-blocks are needed, with Han and Hangul?

    Forget the FS/GS/RS/US hack for his "BOM", it's not even needed in his
    proposal (and it would also make his encoding incompatible with MIME
    restrictions and with lots of transport protocols), just like the
    magic table hwich mooks more like a sample and is not evolutive enough
    to be really self-adaptive to various texts and to newer versions of
    the Unicode standard and new scripts (SCSU also has the same latter
    caveat, which has also been known since long in ISO 2022 using similar
    magic values for code page switching, one good reason for which it
    became unmaintainable).

    You can do exactly the same thing without using this hack, because
    there's much enough unused scalar values that his representation can
    store to support the selection of code pages, and even use these
    scalar values to implement dynamic switching. (In fact you could do
    this also on top of UTF-8 with its existing unused bit patterns).

    (these are the code for which his decoder returns a "NULL" return
    value, but his code is not complete enough as it also forgets to check
    the presence of scalar values reserved for UTF-16 surrogate code
    points, and that any UTF should *never* allow to store like Cropley
    permits here).

    An UTF should not have to make assumptions about all other codepoints
    (not even their normalization or the fact that they may be assigned to
    non-characters), but also the specific reassignment of FS/GS/RS/US
    (which is necessary here for correctly "compressing" non-Latin scripts
    and offer significant impovement compared to UTF-8).

    A good candidate for replacement of UTF-8 should not need any magic
    table, and should be self-adaptive. It is possible to do that, but
    code switching also has its own caveats (notably in fast
    search/compare algorithms, such as those used in "diff" programs and
    in versioning systems for code repositories, because of multiple and
    unpredictable binary representations of the same characters: it's
    something that immediately disqualifies SCSU if code switching can
    occur at any place in encoded texts).

    In practice, a single stable encoding is used without any code
    switching system for representing the texts to diff, and then diffs
    use another specific encoding for insertions/deletions (with position
    and length), optionally followed by a generic binary compressor. In
    those situations, any existing UTF can be used as the base encoding
    for computing diffs, including UTF-8 or even UTF-32, and it does not
    matter if it is space-efficient or not.



    This archive was generated by hypermail 2.1.5 : Tue Feb 22 2011 - 08:36:03 CST