BETA Unicode 6.1.0

BETA Unicode 6.1.0

The next version of the Unicode Standard will be Version 6.1.0, planned for release in February, 2012. A beta version of the 6.1.0 Unicode Character Database files is available for public review. We strongly encourage implementers to review the summary description, download the beta 6.1.0 Unicode Character Database files, and test their programs with the new data, well before the end of the beta period. Beta code charts are also available for review. We encourage users to check the code charts carefully to verify correctness of the new characters added to Unicode 6.1.0 and to ensure that there are no regressions in glyph shapes for previously encoded characters.

The Version 6.1.0 draft of Chapter 3, Conformance is also posted for review. Users of the Unicode Standard should take advantage of this opportunity to provide any feedback on that text, as well.

Summary description Unicode 6.1.0

Unicode character database (UCD) http, ftp

Summary of beta charts Readme.txt

Single-block charts with yellow highlighting for new characters delta charts

Single block charts for all of Unicode 6.1.0 http, ftp

Code charts - single download (89MB) http, ftp

Related Unicode Technical Standards

In addition to the Unicode Standard proper, two other Unicode Technical Standards have significant text and data file updates that are correlated with the new additions for Unicode 6.1.0. Review of that text and data is also encouraged during the beta review period.

Unicode Collation Algorithm Data files: http, ftp

Unicode IDNA Compatibility Processing Data files: http, ftp

Review and Feedback

For guidance on how to focus your review, see the section Notable Issues for Beta Reviewers.

Any feedback should be reported using the contact form. Comments on the Unicode Standard Version 6.1.0 or the Unicode Character Database data files, should refer to the beta review Public Review Issue #206. Comments on specific Version 6.1.0 UAXes and UTSes should refer to the respective Public Review Issue Numbers for each document.

The comment period ends October 24, 2011. All substantive technical comments must have been received by that date for consideration at the November UTC meeting. Editorial comments (typos, etc.) may be still submitted after that date for consideration in the final editorial work.

Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 6.1.0 is final. It is inappropriate to cite these files as other than a work in progress. No products or implementations should be released based on the beta UCD data files -- use only the final, approved Version 6.1.0 data files, expected in February, 2012.

The Unicode Consortium provides early access to updated versions of the data files and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of Version 6.1.0.

The assignment of characters for Unicode 6.1.0 is now stable. There will be no further additions or modifications of code points and no further changes to character names. Please do not submit feedback requesting changes to code points or character names for Unicode 6.1.0, as such feedback is not actionable.

One of the main purposes of the beta review period is to verify and correct the preliminary character property assignments in the Unicode Character Database. Reviewers should check for property changes to existing Unicode 6.0.0 characters, as well as the property values for the new Unicode 6.1.0 character additions.

To facilitate verification of the property changes and additions, diffable XML versions of the Unicode Character Database are available. These XML files are dated, so that people can check the details of changes that occurred during the beta review period. The XML files are in the http://www.unicode.org/Public/6.1.0/diffs/ directory. For more information, see the diffs.readme.txt file.

The beta review period is a good opportunity to add support for the new Unicode 6.1.0 characters in internal versions of software, so that software can be tested to verify that the new characters and property assignments do not cause problems when upgraded to Version 6.1.0 of Unicode.

Notable Issues for Beta Reviewers

All Unicode Standard Annexes are being modified in Unicode 6.1.0, often in coordination with changes to character properties. For current proposed updates to the particular UAXes, see Proposed Updates for Standard Annexes or use the links in the navigation bar on this page. Particular issues in the UAXes may also be the focus of specific Public Review Issues. Each proposed textual change in a UAX is highlighted, so that you can focus your review on those sections if you have limited time. The changes are also listed in detail in the Modifications sections (linked from the table of contents of each document), and are summarized in UAX changes, so you can check on those areas that might be of most interest. Some links between beta documents and the proposed updates for UAXes will not work correctly during the beta review period. This is a known problem which does not need to be reported, as such links are links to the eventual final names or revision numbers for the released versions.

The following blocks are new in Unicode 6.1.0. Check implementations carefully for any range or property value assumptions regarding these new blocks.

Block Name Range

Arabic Extended-A U+08A0..U+08FF

Sundanese Supplement U+1CC0..U+1CCF

Meetei Mayek Extensions U+AAE0..U+AAFF

Meroitic Hieroglyphs U+10980..U+1099F

Meroitic Cursive U+109A0..U+109FF

Sora Sompeng U+110D0..U+110FF

Chakma U+11100..U+1114F

Sharada U+11180..U+111DF

Takri U+11680..U+116CF

Miao U+16F00..U+16F9F

Arabic Mathematical Alphabetic Symbols U+1EE00..U+1EEFF

Certain character properties for newly assigned characters cannot be changed after the formal release of each version of the standard, because of the Character Encoding Stability Policy. Such character property values need special attention during the beta review process, as they cannot be corrected after publication. These include:

Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.

The determination of whether a character is included in identifiers (XID_Start, XID_Continue).

Case mappings and case foldings.

Please also check the following specific items carefully:

Two new Chakma characters, U+1112E and U+1112F, have canonical decompositions. This is unusual for characters off the BMP, and may break certain assumptions used in optimization of implementations of Unicode Normalization. Check that any hard coded assumptions about normalization take these characters into account, and that the characters correctly recompose for NFC.

The default Bidi_Class for two ranges, U+08A0..U+08FF and U+1EE00..U+1EEFF, has been changed from bc=R to bc=AL, because the new blocks for those ranges now contain Arabic characters. Check that default Bidi_Class settings for those ranges are updated accordingly in property tables and in implementations of the Unicode Bidirectional Algorithm.

A new line break class has been added for Hebrew letters: lb=HL. This is used in the definition of a new rule, LB21a, in UAX #14, for handling line breaking for Hebrew characters next to hyphens. Implementations of Unicode line breaking should check that they can correctly handle this additional line break class.

An additional unified ideograph has been added to the main BMP block of CJK unified ideographs: U+9FCC. This extends the range of those CJK unified ideographs by one value. Check implementations for any hard-coded assumptions about the ranges of CJK unified ideographs.

Some characters have had their General_Category values changed from Symbol to either Other_Punctuation (Po) or Other_Number (No). This change does not affect the derivation of identifier-related properties, but may impact assumptions about those characters in some implementations. The change was made to simplify certain kinds of tailoring for the Unicode Collation Algorithm.

The list of scripts recommended for inclusion in or exclusion from identifiers has been updated in UAX #31. That list is not available in machine-readable form in the UCD, so implementations which tailor their identifier usage according to the UAX #31 recommendations will need to refer specifically to that annex for updates.

The Syriac shaping rules specified in Section 8.3, Syriac, of the core specification have been clarified, so that it is clear that the term "dalath or rish" refers to characters with Joining_Group=Dalath_Rish. Also "word breaking character" in the alaph joining rules has been corrected to "non-joining character". Implementers with Syriac shaping engines should check to ensure that their implementations are consistent with those clarifications.

The kTraditionalVariant and kSimplifiedVariant tags and their usage in the Unihan Database have been more fully specified. Implementations which use that data to do simplified/traditional mapping of CJK characters may need to be updated.

The meaning and content of the kMandarin tag in the Unihan Database have been more fully specified. Implementations which use that data to surface pronunciation data for CJK characters may need to be updated.

The meaning and content of the kTotalStrokes tag in the Unihan Database have also been more fully specified. This may impact implementations which use that data as the basis for stroke counts for CJK characters.