From: Mark E. Shoulson (mark@kli.org)
Date: Tue Oct 02 2007 - 06:55:23 CST
Has anyone mentioned the information-theoretic implications of this
scheme? By using a whole *character* to mark the following one as
capitalized, it essentially devotes one whole character, say eight or
sixteen bits or whatever it takes, to transmit *ONE* bit of information
(is it capital or not?). The other variants, "abbreviation" or whatever,
don't change this much; it's still a whole character to handle two or
three bits of information.
What's more, the burden of the extra information falls on the text
itself, not on the encoding tables. We would hope that the text and
information encoded in Unicode should be hundreds of times more than the
size of the tables, even replicated in all the software that uses it.
After all, the point of Unicode is to encode the information, not the
other way around. So the extra bits are repeated in every document that
uses capital letters, rather than just living in the much smaller
encoding tables and software. Admittedly this would be an advantage, I
guess, for documents in languages that don't use capitals, but (a) so
what, and (b) let's face it, Latin is far and away the most used writing
system in computer storage.
As regards Unicode II, it should be noted that there are lots of things
that really are wrong with Unicode, mistakes that shouldn't have been
made but that can't be changed (as opposed to capitalization, which
isn't a mistake). There's the famous case of FHTORA, which is known to
be misspelled, and cannot be changed. Or the annoyance of Hebrew vowel
combining classes setup wrong. And nobody is seriously proposing a
Next-Generation Unicode in which the so-called "Cleanicode" (Unicode
where everything is done *right*) is implemented from scratch. Such a
radical change would not be worth the pain of implementing it.
~mark
This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 06:59:13 CST