Unicode 6.0 Sorting

From: announcements@unicode.org
Date: Fri Oct 29 2010 - 20:06:49 CDT

  • Next message: Martin J. Dürst: "Re: First posting to list: Unicode.org: unicode - punycode converter tool?"

    Mountain View, CA, USA – October 29, 2010 – The new version of Unicode
    Technical Standard #10, Unicode Collation Algorithm (UCA), has been updated
    for Unicode Version 6.0, adding support for 2,088 characters in sorting,
    searching, and matching. Also in this release new data files for support of
    the Unicode Common Locale Data Repository (CLDR), which provides
    customization for different languages.

    Reorderable Categories. The data files for CLDR order characters strictly by
    certain major categories. This allows programmers to parametrically reorder
    these groups of characters to put them in the desired order for different
    languages. For example, numbers can be ordered after letters, or Cyrillic
    before Latin. The reorderable categories are:

    whitespace, punctuation, general symbols, currency symbols, and numbers,
    then Latin, Greek, Coptic, Cyrillic, ..., Egyptian Hieroglyphs, and finally,
    CJK.

    Distinguishing Symbols from Punctuation. UCA provides an option for ignoring
    certain characters when comparing strings. By default, these are whitespace,
    punctuation, and general symbols. The data files for CLDR modify that
    default so that symbols are compared significantly, while still ignoring
    whitespace and punctuation. Thus, for example, "I♥NY" is not sorted the same
    as "I☠NY".

    Special Database Values. The data files for CLDR provide special weights for
    two noncharacters:

    1. A special noncharacter <HIGH> (U+FFFF) for specification of a range in a
    database, allowing "Sch" ≤ X ≤ "Sch<HIGH>" to pick all strings starting with
    "sch" plus those that sort equivalently.

    2. A special noncharacter <LOW> (U+FFFE) for merged database fields,
    allowing "Disílva<LOW>John" to sort next to "Disilva<LOW>John".

    The version of CLDR using these new data files is planned for release at the
    start of December, 2010.

    The text of the UCA standard has been clarified in different areas.
    Implementers should pay special attention to the changes regarding
    ill-formed sequences, noncharacters, and unassigned code points in CJK blocks.

    For more information, see:

    * The UCA Standard 6.0.0: http://www.unicode.org/reports/tr10/
    * The UCA charts: http://unicode.org/charts/collation/
    * The UCA data: http://unicode.org/Public/UCA/6.0.0/
    * Merged database fields: http://unicode.org/reports/tr10/#Interleaved_Levels

    About The Unicode Consortium

    The Unicode Consortium is a non-profit organization founded to develop,
    extend and promote use of the Unicode Standard and related globalization
    standards. The membership of the consortium represents a broad spectrum of
    corporations and organizations in the computer and information processing
    industry.

    Members are: Adobe, Apple, Google, Government of Bangladesh, Government of
    India, IBM, Microsoft, Monotype Imaging, Oracle, The Society for Natural
    Language Technology Research, SAP, The University of California (Berkeley),
    The University of California (Santa Cruz), Yahoo!, plus well over a hundred
    Associate, Liaison, and Individual members.

    For more information, please contact the Unicode Consortium.
    http://www.unicode.org/contacts.html

    ----
    All of the Unicode Consortium lists are strictly opt-in lists for members
    or interested users of our standards. We make every effort to remove
    users who do not wish to receive e-mail from us. To see why you are getting
    this mail and how to remove yourself from our lists if you want, please
    see http://www.unicode.org/consortium/distlist.html#announcements
    


    This archive was generated by hypermail 2.1.5 : Fri Oct 29 2010 - 20:10:42 CDT