From: Doug Ewell (dewell@adelphia.net)
Date: Thu Sep 21 2006 - 08:34:13 CDT
Hans Aberg <haberg at math dot su dot se> wrote:
> Another method, which enables compressing both characters (code
> points) and natural language words (sequences of code points), might
> be to make modified UTF-8, where the leading byte admits indicating
> two categories of numbers. (Continued below.)
Whatever you do, do NOT call it "UTF-anything."
I'm currently compressing names in the Unicode character list using a
variable-length byte-based scheme that encodes common words like LETTER
in 1 byte and rare words like SPATHI in two bytes. The range of trail
bytes is allowed to overlap the range of lead bytes, since backward
parsing doesn't matter for this specific application. It has some
characteristics in common with UTFs, but it isn't a UTF and I pledge not
to call it one.
-- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/ RFC 4645 * UTN #14
This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 08:36:13 CDT