Re: [unicode] UTF-c

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Feb 22 2011 - 08:32:07 CST

Next message: Doug Ewell: "RE: UTF-c"

Previous message: William_J_G Overington: "Re: [unicode] UTF-c"
In reply to: Doug Ewell: "RE: [unicode] UTF-c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

2011/2/21 Doug Ewell <doug@ewellic.org>:
> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
>> And anyway it is also much simpler to understand and easier to
>> implement correctly (not like the sample code given here) than SCSU,
>
> I don't buy this. A simple SCSU encoder, which achieves most of the
> benefits of a complex one, is nearly as simple as Cropley's algorithm.
> Both the complexity of SCSU, and the importance of the complexity of
> SCSU, continue to be highly overrated.

It is complex because of the codepage switching mechanism of SCSU and
its internal "magic" codepage tables.

> Part of the apparent simplicity of Cropley's algorithm, as viewed from
> his "Preliminary Proposal" HTML page, is that it omits a proper
> description of the code-page switching mechanism, as well as the "magic
> number" definitions of the code pages and the control bytes needed to
> introduce them. These are present in the sample code, but to see them,
> you have to paw through the UTF-8 conversion code and UI.

Yes, but he did not implement any codepage swithcing mechanism at all
(the only thing is that it created in fact a single code for producing
dozens of distinctg encodings, eachi one requiring its distinctive
"BOM-like" prefix (but ill-designed in my opinion).

>> and it is still very highly compressible with standard compression
>> algorithms while still allowing very fast processing in memory in its
>> decompressed encoded form :
>
> I see no metrics or sample data to back this up.

I've played the code myself on this. It's easy to read, but lots of
improvements can be done on it (also to secure it, because it's not
safe the way it is implemented).

> How does Cropley's
> algorithm perform with mixed scripts (say Greek and Cyrillic), with
> embedded punctuation in the U+2000 block, with Deseret and other
> alphabets omitted from the Alphabet table, with larger alphabets where
> multiple 64-blocks are needed, with Han and Hangul?

Forget the FS/GS/RS/US hack for his "BOM", it's not even needed in his
proposal (and it would also make his encoding incompatible with MIME
restrictions and with lots of transport protocols), just like the
magic table hwich mooks more like a sample and is not evolutive enough
to be really self-adaptive to various texts and to newer versions of
the Unicode standard and new scripts (SCSU also has the same latter
caveat, which has also been known since long in ISO 2022 using similar
magic values for code page switching, one good reason for which it
became unmaintainable).

You can do exactly the same thing without using this hack, because
there's much enough unused scalar values that his representation can
store to support the selection of code pages, and even use these
scalar values to implement dynamic switching. (In fact you could do
this also on top of UTF-8 with its existing unused bit patterns).

(these are the code for which his decoder returns a "NULL" return
value, but his code is not complete enough as it also forgets to check
the presence of scalar values reserved for UTF-16 surrogate code
points, and that any UTF should *never* allow to store like Cropley
permits here).

An UTF should not have to make assumptions about all other codepoints
(not even their normalization or the fact that they may be assigned to
non-characters), but also the specific reassignment of FS/GS/RS/US
(which is necessary here for correctly "compressing" non-Latin scripts
and offer significant impovement compared to UTF-8).

A good candidate for replacement of UTF-8 should not need any magic
table, and should be self-adaptive. It is possible to do that, but
code switching also has its own caveats (notably in fast
search/compare algorithms, such as those used in "diff" programs and
in versioning systems for code repositories, because of multiple and
unpredictable binary representations of the same characters: it's
something that immediately disqualifies SCSU if code switching can
occur at any place in encoded texts).

In practice, a single stable encoding is used without any code
switching system for representing the texts to diff, and then diffs
use another specific encoding for insertions/deletions (with position
and length), optionally followed by a generic binary compressor. In
those situations, any existing UTF can be used as the base encoding
for computing diffs, including UTF-8 or even UTF-32, and it does not
matter if it is space-efficient or not.

Next message: Doug Ewell: "RE: UTF-c"
Previous message: William_J_G Overington: "Re: [unicode] UTF-c"
In reply to: Doug Ewell: "RE: [unicode] UTF-c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Feb 22 2011 - 08:36:03 CST