[unicode] Re: removing compromises from unicode ("WCode")

From: Jonathan Coxhead (jonathan@doves.demon.co.uk)
Date: Fri Mar 23 2001 - 19:14:45 EST


> WTF-8 could potentially be as compact or more compact than UTF-8 (for
> Greek, Arabic ...), since much of the Latin-1 and Latin Extended A blocks
> aren't needed in WCode. If you moved the other characters down to
> fill that space, you might win what you lost to C1 compatibilty.

   A while ago, I tried to perform a similar exercise: work out which
characters in Unicode are "atomic", and which are compositions of them. Since
it was more of an engineering "jeu d'esprit" than something that might see the
light of day in any actual product, I was utterly ruthless: I even decomposed
'i' into 'dotless i' + 'combining dot above'. (That's not the whole story,
either: 'combining dot above' is not primitive, as it consists of a 'dot' and a
notion of "combination".)

   The result is at <http://www.doves.demon.co.uk/atomic.html>. It has been
mentioned on this list before, but it has been extended and ramified since
then. It doesn't take Unicode 3 into account.

   You may find it interesting in the context of WCode as it has some of the
same goals. Your acronym (WTF) is much better though. :-)

   It would be very entertaining to do the same job with the ideographs (down
to the radical level) and count the number of atoms. I suspect the resulting
"character set" would contain less than 2000 atoms altogether.

   Please do feel free to share any thoughts on the "Atomic Theory" with me!

        /|
 o o o (_|/
        /|
       (_/



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:15 EDT