[unicode] Re: removing compromises from unicode ("WCode")

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Mar 21 2001 - 16:16:05 EST


John Cowan wrote:
> > The result is a back-to-the-principles "WCode", nicely streamlined:
> > - no compatibility or precomposed characters
>
> But less compact. Without precomposed characters, the overhead of
> conversion from old character sets grows considerably.

True. Compactness was not a goal with "WCode", just ease of processing.
Anyway, this is no different from the current Unicode policy for new characters.
Also, I do not think that average strings would become that much longer. Maybe 10-20% for Western Europe.

> > + 8-bit form encodes the BMP uniquely and C1 controls as single bytes
>
> WTF-8 uses 2 bytes for ASCII, C0 Latin

No. It uses single bytes for 00..9f, which includes US-ASCII, C0 & C1 controls.

> , Latin A and B, Modifiers, Marks,
> IPA, and part of Greek; the rest of Greek (the alphabet is actually
> split down the middle), Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana
> are now 3 bytes.
>
> So typical Greek or Cyrillic text now requires 3 bytes per letter in WTF-8,
> 2 bytes in WTF-16. WTF-8 is only really useful for ASCII-compatibility.

Right. You get something - C1 controls as single bytes, as many Unix fans would like to see - and you lose something - compactness of encoding.

markus



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:14 EDT