Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 06 2004 - 15:21:41 CST

Next message: E. Keown: "proposals I wrote (and also, didn't write)"

Previous message: Patrick Andries: "Re: Pour sauver la patrimoine de l'Imprimerie Nationale de France"
In reply to: Arcane Jill: "Re: Nicest UTF"
Next in thread: Kenneth Whistler: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Maybe reply: Kenneth Whistler: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Maybe reply: Rick McGowan: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Maybe reply: Kenneth Whistler: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

----- Original Message -----
From: "Arcane Jill" <arcanejill@ramonsky.com>
> Probably a dumb question, but how come nobody's invented "UTF-24" yet? I
> just made that up, it's not an official standard, but one could easily
> define UTF-24 as UTF-32 with the most-significant byte (which is always
> zero) removed, hence all characters are stored in exactly three bytes and
> all are treated equally. You could have UTF-24LE and UTF-24BE variants,
> and even UTF-24 BOMs. Of course, I'm not suggesting this is a particularly
> brilliant idea, but I just wonder why no-one's suggested it before.

UTF-24 already exists as an encoding form (it is identical to UTF-32), if
you just consider that encoding forms just need to be able to represent a
valid code range within a single code unit.
UTF-32 is not meant to be restricted on 32-bit representations.

However it's true that UTF-24BE and UTF-24LE could be useful as a encoding
schemes for serializations to byte-oriented streams, suppressing one
unnecessary byte per code point.

> (And then of course, there's UTF-21, in which blocks of 21 bits are
> concatenated, so that eight Unicode characters will be stored in every 21
> bytes - and not to mention UTF-20.087462841250343, in which a plain text
> document is simply regarded as one very large integer expressed in radix
> 1114112, and whose UTF-20.087462841250343 representation is simply that
> number expressed in binary. But now I'm getting /very/ silly - please
> don't take any of this seriously.) :-)

I don't think that UTF-21 would be useful as an encoding form, but possibly
as a encoding scheme where 3 always-zero bits would be stripped, providing a
tiny compression level, which would only be justified for transmission over
serial or network links.

However I do think that such "optimization" would have the effect of
removing byte alignments, on which more powerful compressors are working. If
you really need a more effective compression use SCSU or apply some deflate
or bzip2 compression to UTF-8, UTF-16, or UTF-24/32... (there's not much
difference between compressing UTF-24 or UTF-32 with generic compression
algorithms like deflate or bzip2).

> The "UTF-24" thing seems a reasonably sensible question though. Is it just
> that we don't like it because some processors have alignment restrictions
> or something?

There does exists, even still today, 4-bit processors, and 1-bit processors,
where the smallest addressable memory unit is smaller than 8-bit. They are
used for lowcost micro-devices, notably to build automated robots for the
industry, or even for many home/kitchen devices. I don't know whever they do
need Unicode to represent international text, given that they often have a
very limited user interface, incapable of inputing or output text, but who
knows? May be they are used in some mobile phones, or within "smart"
keyboards or tablets or other input devices connected to PCs...

There also exists systems where the smallest addressable memory cell is a
9-bit byte. This is more an issue here, because the Unicode standard does
not specify whever encoding schemes (that serialize code points to bytes)
should set the 9th bit of each byte to 0, or should fill every 8 bit of
memory, even if this means that 8-bit bytes of UTF-8 will not be
synchronized with memory 9-bit bytes.

Somebody already introduced UTF-9 in the past for 9-bit systems.

A 36-bit processor could as well address the memory by cells of 36 bits,
where the 4 highest bits would be either used for CRC control bits
(generated and checked automatically by the processor or a memory bus
interface within memory regions where this behavior would be allowed), or
either used to store supplementary bits of actual data (in unchecked regions
that fit in reliable and fast memory, such as the internal memory cache of
the CPU, or static CPU registers).

For such things, the impact of the transformation of addressable memory
widths through interfaces is for now not discussed in Unicode, which
supposes that internal memory is necessarily addressed in a power of 2 and a
multiple of 8 bits, and then interchanged or stored using this byte unit.

Today, we assist to the constant expansion of bus widths to allow parallel
processing instead of multiplying the working frequency (and the energy
spent and temperature, which generates other environmental problems), so why
the 8-bit byte unit would remain the most efficient universal unit? If you
look at IEEE floatting point formats, they are often implemented in FPU
working on 80-bit units, and a 80-bit memory cell could as well become
tomorrow a standard (compatible with the increasingly used 64-bit
architectures of today) which would no longer be a power of 2 (even if this
stays a multiple of 8 bits).

On a 80-bit system, the easiest solution for handling UTF-32 without using
too much space would be a unit of 40-bits (i.e. two code points per 80-bit
memory cell). But if you consider that 21 bits only are used in Unicode,
than each 80-bit memory cell could store three code points, leaving 17 bits
unused in each addressable memory cell.

Note that 64-bit systems could do the same: 3 code points per 64-bit unit,
requires only 63 bits, that are stored in a single positive 64-bit integer
(the remaining bit would be the sign bit, always set to 0, avoiding problems
related to sign extensions). And even today's system could use such
representation as well, given that most 32-bit processors of today also have
the internal capabilities to manage 64-bit integers natively.

Strings could be encoded as well using only 64-bit code units that would
each store 1 to 3 code points, the unused positions being filled with
invalid codepoints out the Unicode space (for example by setting all 21 bits
to 1, producing the out-of-range code point 0x1FFFFF, used as a filler for
missing code points, notably when the string to encode is not an exact
multiple of 3 code points). Then, these 64-bit code units could be
serialized on byte streams as well, multiplying the number of possibilities:
UTF-64BE and UTF-64LE? One interest of such scheme is that it would be more
compact than UTF-32, because this UTF-64 encoding scheme would waste only 1
bit for 3 codepoints, instead 1 byte and 3 bits for each codepoint with
UTF-32!

You can imagine many other encoding schemes, depending on your architecture
choices and constraints...

Next message: E. Keown: "proposals I wrote (and also, didn't write)"
Previous message: Patrick Andries: "Re: Pour sauver la patrimoine de l'Imprimerie Nationale de France"
In reply to: Arcane Jill: "Re: Nicest UTF"
Next in thread: Kenneth Whistler: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Maybe reply: Kenneth Whistler: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Maybe reply: Rick McGowan: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Maybe reply: Kenneth Whistler: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 15:22:57 CST