Re: Proposing UTF-21/24

From: Mark Davis (mark.davis@icu-project.org)
Date: Sun Jan 21 2007 - 18:47:10 CST

Next message: vunzndi@vfemail.net: "Re: Regulating PUA."

Previous message: Mark Davis: "Re: Regulating PUA."
In reply to: Frank Ellermann: "Re: Proposing UTF-21/24"
Next in thread: Philippe Verdy: "Re: Proposing UTF-21/24"
Reply: Philippe Verdy: "Re: Proposing UTF-21/24"
Reply: Frank Ellermann: "Re: Proposing UTF-21/24"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This has the very significant problem of ASCII incompatibility: the key
advantage of UTF-8 is that values of 0..127 are never part of a multibyte
character. That is one of the reasons why the simple approach of just using
7 bits of content with a bit to say "has continuation", while considered,
never got any traction. (That mechanism for compressing integers or arrays
of them, on the other hand, is fairly common.)

IMO, the whole discussion of "UTF-24" is of only academic interest; both
UTF-8 and UTF-16 have better storage characteristics (remember that 4-byte
characters have, and will have, extremely low frequency of usage), and for
in-memory handling "UTF-24" doesn't buy much.

Mark

On 1/21/07, Frank Ellermann < nobody@xyzzy.claranet.de> wrote:
>
> David Starner wrote:
>
> > current encodings designed with a extreme concern for size, like
> > SCSU and BOCU, frequently aren't used, because UTF-8 or UTF-16
> > combined with a general purpose compression scheme works much
> > better for any long text.
>
> Yes, but the 3*7 approach is still fascinating because it's so
> simple. When UTF-8 was invented they couldn't do this, they
> needed something for 31 bits.
>
> With 3*7 it's (in theory) possible to replace UTF-8 by "UTF-24"
> using the "self delimiting numeric values" (SDNV) proposed in
> <http://tools.ietf.org/html/draft-eddy-dtn-sdnv >
>
> Each octet transports 7 bits ?1234567. If the most significant
> bit is a 0 it's the terminating octet, otherwise another octet
> follows. With that you'd get:
>
> 1x 1y 0z => 21 bits (for 1x different from 1000 0000)
> 1x 0y => 14 bits (for 1x different from 1000 0000)
> 0x => 7 bits (the ASCII range)
>
> Of course the 0y or 0z in multibyte sequences could cause havoc,
> especially for 0000 0000, but in theory it's simpler than UTF-8.
>
> Frank
>
>
>
>

-- 
Mark

Next message: vunzndi@vfemail.net: "Re: Regulating PUA."
Previous message: Mark Davis: "Re: Regulating PUA."
In reply to: Frank Ellermann: "Re: Proposing UTF-21/24"
Next in thread: Philippe Verdy: "Re: Proposing UTF-21/24"
Reply: Philippe Verdy: "Re: Proposing UTF-21/24"
Reply: Frank Ellermann: "Re: Proposing UTF-21/24"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 18:49:21 CST