From: Mark Davis (mark.davis@icu-project.org)
Date: Sun Jan 21 2007 - 18:47:10 CST
This has the very significant problem of ASCII incompatibility: the key
advantage of UTF-8 is that values of 0..127 are never part of a multibyte
character. That is one of the reasons why the simple approach of just using
7 bits of content with a bit to say "has continuation", while considered,
never got any traction. (That mechanism for compressing integers or arrays
of them, on the other hand, is fairly common.)
IMO, the whole discussion of "UTF-24" is of only academic interest; both
UTF-8 and UTF-16 have better storage characteristics (remember that 4-byte
characters have, and will have, extremely low frequency of usage), and for
in-memory handling "UTF-24" doesn't buy much.
Mark
On 1/21/07, Frank Ellermann < nobody@xyzzy.claranet.de> wrote:
>
> David Starner wrote:
>
> > current encodings designed with a extreme concern for size, like
> > SCSU and BOCU, frequently aren't used, because UTF-8 or UTF-16
> > combined with a general purpose compression scheme works much
> > better for any long text.
>
> Yes, but the 3*7 approach is still fascinating because it's so
> simple. When UTF-8 was invented they couldn't do this, they
> needed something for 31 bits.
>
> With 3*7 it's (in theory) possible to replace UTF-8 by "UTF-24"
> using the "self delimiting numeric values" (SDNV) proposed in
> <http://tools.ietf.org/html/draft-eddy-dtn-sdnv >
>
> Each octet transports 7 bits ?1234567. If the most significant
> bit is a 0 it's the terminating octet, otherwise another octet
> follows. With that you'd get:
>
> 1x 1y 0z => 21 bits (for 1x different from 1000 0000)
> 1x 0y => 14 bits (for 1x different from 1000 0000)
> 0x => 7 bits (the ASCII range)
>
> Of course the 0y or 0z in multibyte sequences could cause havoc,
> especially for 0000 0000, but in theory it's simpler than UTF-8.
>
> Frank
>
>
>
>
-- Mark
This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 18:49:21 CST