Re: Proposing UTF-21/24

From: Addison Phillips (addison@yahoo-inc.com)
Date: Tue Jan 23 2007 - 11:11:11 CST

  • Next message: Marion Gunn: "Re: Proposing UTF-21/24"

    > UTF-8 and UTF-16 provide better compression for *some* ranges at the expense of
    > *others* - based on the authors' preferences for certain scripts. This is comparable
    > to a 24bpp graphics encoding format, which would encode blue component with 8 bit at
    > the expense of encoding red with 32 bit, simply because the author likes shades of
    > blue and dislikes shades of red.

    No, it isn't. Colors are evenly distributed and occur in equal frequency.

    If the distribution of characters within Unicode and the distribution of
    characters, languages, and documents even, your assertion might be true.
    But they aren't.

    The Basic Multilingual Plane contains, by design, the vast preponderance
    of the useful characters from the vast preponderance of living
    languages. Scripts that are measurably rare, used for fictional or
    extinct languages, or which represent historical variations are found in
    the (lower) supplemental planes.

    One might think that all the unassigned code points in Unicode will
    eventually fill up, perhaps with more common characters than the current
    supplemental collection. But the reality is that there just aren't that
    many scripts in the world, that scripts change extremely slowly, and
    that the currently unencoded scripts will always be rare. Further, the
    distribution of characters in the real world will *never* favor UTF-24
    for efficiency, even if we all switched to using a mix of cuneiform and
    Bliss symbols tomorrow. For example, some measurements of the Web show
    that fully 50% of the text consists of (ASCII) markup.

    Addison

    -- 
    Addison Phillips
    Globalization Architect -- Yahoo! Inc.
    Internationalization is an architecture.
    It is not a feature.
    


    This archive was generated by hypermail 2.1.5 : Tue Jan 23 2007 - 11:13:44 CST