BETA Unicode 9.0.0

BETA Unicode® 9.0.0

Note: The beta review period for Unicode 9.0.0 has closed, as of May 13, 2016. Feedback received during the public review can be referred to from PRI #323. This beta review page is left active, however, for convenience of access to the prepublication versions of the Unicode 9.0.0 data files and annexes, until the formal release planned for mid-June, 2016.

The next version of the Unicode Standard will be Version 9.0.0, planned for release in June, 2016. This version updates several annexes to deal with segmentation issues for sequences of characters displayed as emoji, and adds significant new repertoire. A total of 7,500 new characters are encoded, including numerous popular emoji symbols, 6 new scripts, and multiple additions to existing blocks.

A beta version of the 9.0.0 Unicode Character Database files is available for public review. We strongly encourage implementers to review the summary description, download the beta 9.0.0 Unicode Character Database files, and test their programs with the new data, well before the end of the beta period. It is especially important to review the Notable Issues for Beta Reviewers.

We encourage users to check the code charts carefully to verify correctness of the new characters added to Unicode 9.0.0 and to ensure that there are no regressions in glyph shapes for previously encoded characters.

Summary description

Unicode character database (UCD)

Summary of beta charts

Single-block delta charts with yellow highlighting for new characters

Single-block charts for all of Unicode 9.0.0 (in preparation)

Code charts - single download (101 MB)

Auxiliary HTML charts for beta review

Related Unicode Technical Standards

In addition to the Unicode Standard proper, three other Unicode Technical Standards have significant text and data file updates that are correlated with the new additions for Unicode 9.0.0. Review of that text and data is also encouraged during the beta review period.

UTS #10, Unicode Collation Algorithm Data files

UTS #39, Unicode Security Mechanisms Data files

UTS #46, Unicode IDNA Compatibility Processing Data files

Review and Feedback

For guidance on how to focus your review, see the section Notable Issues for Beta Reviewers.

Any feedback should be reported using the contact form. Comments on the Unicode Standard Version 9.0.0 or the Unicode Character Database data files should refer to the beta review Public Review Issue #323. Comments on specific Version 9.0.0 UAXes and UTSes should refer to the respective Public Review Issue Numbers for each document, where available.

The comment period ends May 2, 2016. All substantive technical comments must have been received by that date for consideration at the May UTC meeting. Editorial comments (typos, etc.) may be still submitted after that date for consideration in the final editorial work.

Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 9.0.0 is final. It is inappropriate to cite these files as other than a work in progress. No products or implementations should be released based on the beta UCD data files—use only the final, approved Version 9.0.0 data files, expected in June 2016.

The Unicode Consortium provides early access to updated versions of the data files and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of Version 9.0.0.

The assignment of characters for Unicode 9.0.0 is now stable. There will be no further additions or modifications of code points and no further changes to character names. Please do not submit feedback requesting changes to code points or character names for Unicode 9.0.0, as such feedback is not actionable.

One of the main purposes of the beta review period is to verify and correct the preliminary character property assignments in the Unicode Character Database. Reviewers should check for property changes to existing Unicode 8.0.0 characters, as well as the property values for the new Unicode 9.0.0 character additions. The Auxiliary HTML charts include the new characters highlighted in yellow, with names appearing when hovering over a cell. These charts may be useful for reviewing information such as the default collation order, Script property assignments, and so forth during beta review.

To facilitate verification of the property changes and additions, diffable XML versions of the Unicode Character Database are available. These XML files are dated, so that people can check the details of changes that occurred during the beta review period. For more information, see the diffs.readme.txt file.

The beta review period is a good opportunity to add support for the new Unicode 9.0.0 characters in internal versions of software, so that software can be tested to verify that the new characters and property assignments do not cause problems when upgraded to Version 9.0.0 of Unicode.

Notable Issues for Beta Reviewers

Changes to Unicode Standard Annexes

Some of the Unicode Standard Annexes have modifications for Unicode 9.0.0, often in coordination with changes to character properties. Most notably for Unicode 9.0.0:

UAX #14, Unicode Line Breaking Algorithm: New property values and new algorithm rules have been introduced. These changes ensure that the various types of character sequences that represent emoji are handled as indivisible units in line breaking.

UAX #29, Unicode Text Segmentation: New property values and new algorithm rules have been introduced. These also address sequences that represent emoji, to ensure they are handled as indivisible units in the formation of grapheme clusters and word segments.

UAX #31, Unicode Identifier and Pattern Syntax: Table 6, Aspirational Scripts and Table 7, Limited Use Scripts were updated by adding the 6 new scripts in Unicode 9.0.0. A recommended syntax for Unicode hashtags, including emoji, has been added. Furthermore, the text has been rewritten to emphasis the preference for XID_Start/XID_Continue over ID_Start/ID_Continue properties.

Core Specification Update

The core specification is undergoing extensive review, with numerous additions for Version 9.0.0. Although the draft text for Version 9.0.0 is not yet available, specific reports of any technical or editorial issues in the currently published core specification are also welcome during the beta review period. Such reports will be taken into consideration for corrections to the Version 9.0.0 draft. (Note: The Unicode Consortium has ongoing opportunities for subject-matter volunteers: experts interested in contributing to or editing relevant parts of the core specification or other Unicode specifications.)

Script-specific Issues

Six new scripts have been added in Unicode 9.0. All of these additions are on Plane 1. Some of these scripts have particular attributes which may cause issues for implementations. The more important of these attributes are summarized here.

Two of the newly encoded scripts, Osage and Adlam, are bicameral. This means that support will require addition of case mapping and case folding tables for them.

Adlam is also a right-to-left script with cursive joining, so it requires bidirectional support and has rendering issues similar to those of the Arabic script.

Tangut is a very large siniform ideographic script. It is the first siniform ideographic script encoded after the Han (CJK) script. Its implementation requires technology support similar to that used for CJK, including very large fonts and radical/stroke input methods. Special adjustments have also been made to the Unicode Collation Algorithm to account for the introduction of another large ideographic repertoire.

Casing-related Issues

A set of nine historic Cyrillic letter forms (U+1C80..U+1C88) used in Old Church Slavonic were added. These letters are lowercase and have asymmetric case mappings to existing uppercase letters, similar to the asymmetric case mapping of Greek final sigma to capital sigma. Case folding for these nine Cyrillic letters needs to be implemented with care.

An uppercase Latin letter was added, U+A7AE LATIN CAPITAL LETTER SMALL CAPITAL I, forming a case pair with an existing lowercase letter, U+026A LATIN LETTER SMALL CAPITAL I, for which a different uppercase counterpart had been recommended, but not formally mapped, prior to Unicode 9.0.

Numeric-related Issues

Some of the newly encoded Malayalam fractions have numeric values which are new in Unicode 9.0. Implementations that process numeric values should be prepared to handle new fractional values, such as 1/20 or 1/40.

The newly added script Bhaiksuki has both script-specific decimal digits and non-decimal unit numerals.

Unihan-related Issues

The syntax of the kRS* fields, such as kRSUnicode and kRSKangXi, has been extended to allow for negative values of residual stroke counts. A negative value indicates that strokes which would normally constitute the indexing radical are intentionally missing. The kRSUnicode and kRSKangXi fields of a few CJK ideographs have been updated accordingly. Implementers should be prepared to handle negative values for residual stroke counts. In sorting, negative values should be replaced with zero to prevent characters with such values from sorting before the characters that represent the radical itself.

Many kMandarin readings have been updated. Implementations which depend on the kMandarin readings, such as phonetic sortings of Chinese data, need to be checked against these changes.

Standardized Variation Sequences

The constraints on standardized variation sequences have been relaxed slightly, to allow a spacing combining mark (General_Category=Spacing_Mark) as the initial character of a variation sequence. Nonspacing combining marks and canonical decomposable characters continue to be disallowed in variation sequences. Implementations should be checked for any assumptions made regarding the allowed General_Category property values for the initial characters in variation sequences.

A full set of dotted forms of Myanmar letters for Khamti, Aiton, and Phake were added as standardized variation sequences, to distinguish them from the Burmese and Shan styles. One of these new standardized variation sequences has a spacing combining mark as the initial character of the sequence: <U+1031, U+FE00>.

A set of 278 variation sequences were added to complete the set of text and emoji presentations for all pictographic symbols identified as having a default text presentation. See UTR #51, Unicode Emoji.

Code Charts

There has been significant change to the code charts for Mongolian since the publication of Unicode 8.0. In addition to corrections for omitted glyphs, the charts have been updated to display more as they did in Unicode 7.0, with a summary of all Mongolian standardized variation sequences displayed at the end of the Mongolian block. The names list section now also shows contextual variant glyphs. These appear for each character that also has one or more standardized variation sequences associated with it.

Other Issues

Please also check the following specific items carefully:

There have been significant additions to the Script_Extensions property. Implementations that process script data or use script extensions should be checked carefully.

Tangut was added to UnicodeData.txt with a start line and end line, similar to the way that data file handles CJK unified ideographs. Parsers of UnicodeData.txt may need to be updated to handle this new range.

Copyright and registered signs are now used in the data files. The file encoding did not change, and still is UTF-8, but now there are non-ASCII characters actually present in many files that formerly contained only ASCII characters.

Two existing Mongolian characters, U+1885 and U+1886, were reclassified, in terms of General_Category and Bidi_Class, from letters to nonspacing combining marks. In order to keep identifier derivations stable, these two Mongolian combining marks have also been made Other_ID_Start=Y. Implementations of identifiers, in particular, should be checked to verify that they correctly accommodate these changes.

U+E007F CANCEL TAG has been un-deprecated for potential use in the proposed syntax for emoji customization.

The East_Asian_Width property values of 799 existing characters that have an emoji presentation were changed to Wide.

The following blocks are new in Unicode 9.0.0. Check implementations carefully for any range or property value assumptions regarding these new blocks. See also the single-block delta charts.

Range Block Name

1C80..1C8F Cyrillic Extended-C

104B0..104FF Osage

11400..1147F Newa

11660..1167F Mongolian Supplement

11C00..11C6F Bhaiksuki

11C70..11CBF Marchen

16FE0..16FFF Ideographic Symbols and Punctuation

17000..187FF Tangut

18800..18AFF Tangut Components

1E000..1E02F Glagolitic Supplement

1E900..1E95F Adlam

Some blocks have also had font updates to use better font designs, often picking up commercial font designs to better reflect current design practice for a script. Notable among these are font updates for the Cherokee, Kannada, and Malayalam blocks, as well as an improved design for the lari currency sign. In such cases, careful review of the blocks in question is advised, to ensure that there have not been any regressions in representative glyph display.

General Issues

For current proposed updates to the particular UAXes, see Proposed Updates for Standard Annexes or use the links in the navigation bar on this page. Particular issues in the UAXes may also be the focus of specific Public Review Issues. Each proposed textual change in a UAX is highlighted, so that you can focus your review on those sections if you have limited time. The changes are also listed in detail in the Modifications sections (linked from the table of contents of each document), and are summarized in UAX changes, so you can check on those areas that might be of most interest.

Some links between beta documents and the proposed updates for UAXes will not work correctly during the beta review period. This is a known problem which does not need to be reported, as such links point to the eventual final names or revision numbers for the released versions.

Stability

Certain character properties for newly assigned characters cannot be changed after the formal release of each version of the standard, because of the Character Encoding Stability Policy. Such character property values need special attention during the beta review process, as they cannot be corrected after publication. These include:

Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.

The determination of whether a character is included in identifiers (XID_Start, XID_Continue).

Case mappings and case foldings.