From: announcements@unicode.org
Date: Fri Oct 29 2010 - 20:06:49 CDT
Mountain View, CA, USA – October 29, 2010 – The new version of Unicode
Technical Standard #10, Unicode Collation Algorithm (UCA), has been updated
for Unicode Version 6.0, adding support for 2,088 characters in sorting,
searching, and matching. Also in this release new data files for support of
the Unicode Common Locale Data Repository (CLDR), which provides
customization for different languages.
Reorderable Categories. The data files for CLDR order characters strictly by
certain major categories. This allows programmers to parametrically reorder
these groups of characters to put them in the desired order for different
languages. For example, numbers can be ordered after letters, or Cyrillic
before Latin. The reorderable categories are:
whitespace, punctuation, general symbols, currency symbols, and numbers,
then Latin, Greek, Coptic, Cyrillic, ..., Egyptian Hieroglyphs, and finally,
CJK.
Distinguishing Symbols from Punctuation. UCA provides an option for ignoring
certain characters when comparing strings. By default, these are whitespace,
punctuation, and general symbols. The data files for CLDR modify that
default so that symbols are compared significantly, while still ignoring
whitespace and punctuation. Thus, for example, "I♥NY" is not sorted the same
as "I☠NY".
Special Database Values. The data files for CLDR provide special weights for
two noncharacters:
1. A special noncharacter <HIGH> (U+FFFF) for specification of a range in a
database, allowing "Sch" ≤ X ≤ "Sch<HIGH>" to pick all strings starting with
"sch" plus those that sort equivalently.
2. A special noncharacter <LOW> (U+FFFE) for merged database fields,
allowing "DisÃlva<LOW>John" to sort next to "Disilva<LOW>John".
The version of CLDR using these new data files is planned for release at the
start of December, 2010.
The text of the UCA standard has been clarified in different areas.
Implementers should pay special attention to the changes regarding
ill-formed sequences, noncharacters, and unassigned code points in CJK blocks.
For more information, see:
* The UCA Standard 6.0.0: http://www.unicode.org/reports/tr10/
* The UCA charts: http://unicode.org/charts/collation/
* The UCA data: http://unicode.org/Public/UCA/6.0.0/
* Merged database fields: http://unicode.org/reports/tr10/#Interleaved_Levels
About The Unicode Consortium
The Unicode Consortium is a non-profit organization founded to develop,
extend and promote use of the Unicode Standard and related globalization
standards. The membership of the consortium represents a broad spectrum of
corporations and organizations in the computer and information processing
industry.
Members are: Adobe, Apple, Google, Government of Bangladesh, Government of
India, IBM, Microsoft, Monotype Imaging, Oracle, The Society for Natural
Language Technology Research, SAP, The University of California (Berkeley),
The University of California (Santa Cruz), Yahoo!, plus well over a hundred
Associate, Liaison, and Individual members.
For more information, please contact the Unicode Consortium.
http://www.unicode.org/contacts.html
---- All of the Unicode Consortium lists are strictly opt-in lists for members or interested users of our standards. We make every effort to remove users who do not wish to receive e-mail from us. To see why you are getting this mail and how to remove yourself from our lists if you want, please see http://www.unicode.org/consortium/distlist.html#announcements
This archive was generated by hypermail 2.1.5 : Fri Oct 29 2010 - 20:10:42 CDT