From: Addison Phillips (addison@yahoo-inc.com)
Date: Tue Jan 23 2007 - 11:11:11 CST
> UTF-8 and UTF-16 provide better compression for *some* ranges at the expense of
> *others* - based on the authors' preferences for certain scripts. This is comparable
> to a 24bpp graphics encoding format, which would encode blue component with 8 bit at
> the expense of encoding red with 32 bit, simply because the author likes shades of
> blue and dislikes shades of red.
No, it isn't. Colors are evenly distributed and occur in equal frequency.
If the distribution of characters within Unicode and the distribution of
characters, languages, and documents even, your assertion might be true.
But they aren't.
The Basic Multilingual Plane contains, by design, the vast preponderance
of the useful characters from the vast preponderance of living
languages. Scripts that are measurably rare, used for fictional or
extinct languages, or which represent historical variations are found in
the (lower) supplemental planes.
One might think that all the unassigned code points in Unicode will
eventually fill up, perhaps with more common characters than the current
supplemental collection. But the reality is that there just aren't that
many scripts in the world, that scripts change extremely slowly, and
that the currently unencoded scripts will always be rare. Further, the
distribution of characters in the real world will *never* favor UTF-24
for efficiency, even if we all switched to using a mix of cuneiform and
Bliss symbols tomorrow. For example, some measurements of the Web show
that fully 50% of the text consists of (ASCII) markup.
Addison
-- Addison Phillips Globalization Architect -- Yahoo! Inc. Internationalization is an architecture. It is not a feature.
This archive was generated by hypermail 2.1.5 : Tue Jan 23 2007 - 11:13:44 CST