Re: Towards a classification system for uses of the Private Use Area

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Apr 29 2002 - 14:59:22 EDT


> I am fond of precision and try to be precise,

though not concise. ;-)

> so, if my statement is wrong I
> will happily change it. Yet what should I change it to become? As far as I
> know, unicode is a 21 bit system. ... ...
> Let us go into this as precisely as we in this
> discussion group can go.
>
> So, is the unicode system today a 21 bit system?
>
> Is there anyone prepared to state that it is not a 21 bit system, stating
> reasons for so saying?

Yes, me.

First, the Unicode Standard is not a "system", except in the most general
senses of system. It is, to be precise, a character encoding.

That character encoding uses the codespace 0..0x10FFFF. That range of
positive integers can be expressed using 21 bits, to wit:
0 0000 0000 0000 0000 0000 .. 1 0000 1111 1111 1111 1111
but that does not make it a "21 bit system" per se. People *have*, in fact
talked about Unicode as a 21-bit character encoding because of the
codespace size, but even that isn't quite accurate, because the range
is 0..0x10FFFF (more like 20.1 bits), not 0..0x1FFFFF (21 bits).

Note also that the actual entropy* of Unicode is quite low, since a)
most of the codespace is unencoded, and b) since most of the encoded
characters are actually quite rare in usage. This accounts for why
Unicode data compresses so well. I'd estimate that for most Unicode
data, the entropy is more on the order of 6 bits per character, possibly
lower.

*entropy: Sum -Prob(Si) * log2 Prob(Si)
          0<=i<=0x10FFFF

The Unicode Standard uses 3 encoding forms: UTF-8, UTF-16, UTF-32,
expressed, respectively, in terms of 8-bit byte code units, 16-bit wyde
code units, and 32-bit word code units. *Those* are the significant
units for all computer implementations that actually handle the characters.

If, on the other hand, one starts referring to the Unicode Standard
as a "21 bit system", one may immediately go astray trying to invent
uses for "unused" bits:

> It
> seems to me that, unicode using 21 bits, and whereas computer storage media
> such as hard discs is oriented to 8 bit bytes, that there is scope to
> develop a file coding format that is essentially of a plain text nature
> where characters are treated as being 24 bits long, so that each character
> is stored as 3 bytes and code points starting, at 24 bits, expressed in
> hexadecimal, having the first hexadecimal character as 0 or 1 is reserved
> for Unicode, 2 is unused, and the rest are used for inputting and
> manipulating data.

Such a scheme is a) non-conformant with the Unicode Standard, and b) misses
the point about the distinction between the encoding per se and the
3 encoding forms which are used for actual handling of characters.

> Such a special file format and coding system might be
> very useful for encoding physical data about cuneiform tablets and Unicode
> character codes together in one essentially plain text file,

This would not be "plain text" in any sense promoted by the Unicode Standard.

> I have it in mind that the various sections would be as follows.
>
> 000000 to 10FFFF Unicode
> 110000 to 2FFFFF reserved
> 300000 to 3FFFFD control, though only a few of these code points are going
> to be used.
> 400000 to 4FFFFD obey the current x0p process, load 18 bits of data into
> register x0, obey the current x0q process.

> The 24 processes x0p, x0q and so on all have default actions. ...

You are, of course, free to go off and invent such schemes, but this
is not Unicode, nor is it plain text. It also violates most modern
practice in software design, which avoids such mixing of layers in favor
of clear, layered and modular design.

The days when assembly programmers had to reuse bits and write self-modifying
code are long gone (except, I suppose in chip microcode) -- blown away by the
hardware advances that have made memory and storage resources orders of magnitude
cheaper than the costs of software development and maintenance.

> Yet I feel that such application possibilities to be able to use Unicode
> characters in conjunction with graphic data with everything encoded together
> in an open format file are an important possibility for the future.

Um. Have you heard of HTML and XML?

--Ken



This archive was generated by hypermail 2.1.2 : Mon Apr 29 2002 - 15:52:54 EDT