BETA Unicode 5.2.0
The next version of the Unicode Standard will be Version 5.2.0.
The beta version of the summary for Unicode 5.2.0 is located
at:
http://www.unicode.org/versions/Unicode5.2.0/
This version is planned for release in
October 2009. A beta version of the 5.2.0 Unicode Character Database
files is also available for public comment. We strongly encourage
implementers to download these files and test them with their
programs, well before the end of the beta period. These files are located in:
http://www.unicode.org/Public/5.2.0/ or
ftp://www.unicode.org/Public/5.2.0/
Draft code charts are also available for beta review. Please
check the code charts carefully to verify correctness of
the new characters added to Unicode 5.2 and to ensure that
there are no regressions for previously encoded characters.
The draft code charts are located in:
http://www.unicode.org/Public/5.2.0/charts/ or
ftp://www.unicode.org/Public/5.2.0/charts/
For guidance on how to focus your review, see the section
Notable Issues for Beta Testers below.
Any comments on the beta Unicode 5.2.0, the UCD 5.2.0, or the
5.2.0 UAXes should be
reported using the Unicode
reporting form. The comment period ended
August 3, 2009.
All substantive comments must have been received by that date for
consideration at the August UTC meeting. Editorial comments (typos,
etc.) may be still submitted after that date for consideration in the final
editorial work.
Note: All beta files may be updated, replaced, or
superseded by other files at any time. The beta files will be
discarded once Unicode 5.2.0 is final. It is inappropriate to cite
these files as other than a work in progress. No
products or implementations should be released based on the beta
UCD data files -- use only the final, approved Version 5.2.0 data
files, expected in October 2009.
The Unicode Consortium provides early access to updated versions of the data files
and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of
Version 5.2.0.
The assignment of characters for Unicode 5.2.0 is now stable. There will be no further additions or modifications of code points.
One of the main purposes of the beta review period, however, is to verify and
correct the preliminary character property assignments in the Unicode Character
Database. Reviewers should check for property changes to existing Unicode 5.1.0
characters, as well as the property values for the new Unicode 5.2.0 character
additions. To facilitate verification of the property changes and additions,
diffable XML versions of the Unicode Character Database are available. These XML
files are dated, so that people can check the details of changes that occurred
during the beta review period. The XML
files are in the
http://www.unicode.org/Public/5.2.0/diffs/ directory. For more information,
see the
diffs.readme.txt
file.
The beta review period is a good opportunity to add support for the new
Unicode 5.2.0 characters in internal versions of software, so that software can
be tested to verify that the new characters and property assignments don't cause
problems when upgraded to Version 5.2.0 of Unicode.
Notable Issues for Beta Testers
All Unicode Standard Annexes are being modified in
Unicode 5.2.0, and often in coordination with changes in properties. To see the
current proposed updates to the particular UAXes, see
Proposed Updates for Standard Annexes.
Particular issues in the UAXes are also the focus of specific
Public Review Issues.
Each proposed change in a UAX is highlighted, so that you can focus
your review on those sections if you have limited time. The changes
are also listed in each Modifications section (linked from the table
of contents), so you can check on those areas that might be of most
interest. Some links between beta documents and the proposed
updates for UAXes will not work correctly during the
beta review period. This is a known problem which does
not need to be reported, as such links are links to
the eventual final names or revision numbers for the
released versions.
The documentation for the UCD has been consolidated into the
Proposed
Update for UAX #44. Please review this carefully; there have
been extensive changes involved in this consolidation.
Please check the following carefully:
- There is a new BidiTest file with test cases for
assessing conformance to the Unicode Bidirectional Algorithm. The format
and data need careful review.
- There are three new characters in the newly-encoded Kaithi script that
will require changes in implementations which make hard-coded assumptions
about composition during normalization. Most new characters added to
the standard with decompositions cannot be generated by the operations
toNFC() or toNFKC(), but these three can. Implementers should check their
code carefully to ensure that it handles these three characters correctly.
- U+1109A KAITHI LETTER DDDHA
- U+1109C KAITHI LETTER RHA
- U+110AB KAITHI LETTER VA
- Any hard-coded range assumptions about Unified CJK Ideographs in
implementations may need fixing, because the end range for those has changed
from U+9FC3 to U+9FCB in this version. There is also an entirely new block of
CJK Unified Ideographs: CJK Unified Ideographs Extension C (U+2A700..U+2B73F),
with characters encoded in the range U+2A700 to U+2B734.
- There is now an assigned Hangul jamo character at U+11A7. This may interfere with some
implementations' boundary testing for Hangul decomposition.
- There are new case-related properties in DerivedCoreProperties.txt and
DerivedNormalizationProps.txt that should be reviewed carefully. The new case-related
derived properties are NFKC_Casefold, Case_Ignorable, Cased, Changes_When_Lowercased,
Changes_When_Uppercased, Changes_When_Titlecased, Changes_When_Casemapped,
Changes_When_Casefolded, and Changes_When_NFKC_Casefolded.
- New uppercase parenthesized symbols have been added. Unlike the circled letter symbols,
there are no uppercase/lowercase relationships for these new characters, to
match the existing treatment of the lowercase parenthesized letter symbols.
- Contributory is considered to be a distinct status for a
Unicode character property. Contributory properties are neither normative
nor informative. This distinct status is marked in the property
table.
- Two new joining groups, FARSI YEH and NYA, were added.
- There is a new file, CJKRadicals.txt . Unlike
other files, the first field is not a code point number.
- The Unihan.txt file in Unihan.zip is split into 8 separate files within the zip file, organized by category.
See the Proposed Update for UAX #38 for details.