[unicode] Re: removing compromises from unicode ("WCode")

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Mar 21 2001 - 16:16:05 EST

Next message: Richard Cook: "[unicode] Re: Spam being sent to the list?"
Previous message: Marco Cimarosti: "[unicode] Re: UCS-2 Files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

John Cowan wrote:
> > The result is a back-to-the-principles "WCode", nicely streamlined:
> > - no compatibility or precomposed characters
>
> But less compact. Without precomposed characters, the overhead of
> conversion from old character sets grows considerably.

True. Compactness was not a goal with "WCode", just ease of processing.
Anyway, this is no different from the current Unicode policy for new characters.
Also, I do not think that average strings would become that much longer. Maybe 10-20% for Western Europe.

> > + 8-bit form encodes the BMP uniquely and C1 controls as single bytes
>
> WTF-8 uses 2 bytes for ASCII, C0 Latin

No. It uses single bytes for 00..9f, which includes US-ASCII, C0 & C1 controls.

> , Latin A and B, Modifiers, Marks,
> IPA, and part of Greek; the rest of Greek (the alphabet is actually
> split down the middle), Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana
> are now 3 bytes.
>
> So typical Greek or Cyrillic text now requires 3 bytes per letter in WTF-8,
> 2 bytes in WTF-16. WTF-8 is only really useful for ASCII-compatibility.

Right. You get something - C1 controls as single bytes, as many Unix fans would like to see - and you lose something - compactness of encoding.

markus

Next message: Richard Cook: "[unicode] Re: Spam being sent to the list?"
Previous message: Marco Cimarosti: "[unicode] Re: UCS-2 Files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:14 EDT