From: Frank Ellermann (nobody@xyzzy.claranet.de)
Date: Sun Jan 21 2007 - 16:30:24 CST
David Starner wrote:
> current encodings designed with a extreme concern for size, like
> SCSU and BOCU, frequently aren't used, because UTF-8 or UTF-16
> combined with a general purpose compression scheme works much
> better for any long text.
Yes, but the 3*7 approach is still fascinating because it's so
simple. When UTF-8 was invented they couldn't do this, they
needed something for 31 bits.
With 3*7 it's (in theory) possible to replace UTF-8 by "UTF-24"
using the "self delimiting numeric values" (SDNV) proposed in
<http://tools.ietf.org/html/draft-eddy-dtn-sdnv>
Each octet transports 7 bits ?1234567. If the most significant
bit is a 0 it's the terminating octet, otherwise another octet
follows. With that you'd get:
1x 1y 0z => 21 bits (for 1x different from 1000 0000)
1x 0y => 14 bits (for 1x different from 1000 0000)
0x => 7 bits (the ASCII range)
Of course the 0y or 0z in multibyte sequences could cause havoc,
especially for 0000 0000, but in theory it's simpler than UTF-8.
Frank
This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 16:34:41 CST