BETA Unicode 10.0.0

BETA Unicode® 10.0.0

Note: The beta review period for Unicode 10.0.0 has closed, as of May 1, 2017. Feedback received during the public review can be referred to from PRI #350. This beta review page is left active, however, for convenience of access to the prepublication versions of the Unicode 10.0.0 data files and annexes, until the formal release planned for mid-June, 2017.

The next version of the Unicode Standard will be Version 10.0.0, planned for release in June, 2017. This version updates several annexes to deal with segmentation issues and adds significant new repertoire. A total of 8,518 new characters are encoded, including 56 new emoji characters, 4 new scripts, and multiple additions to existing blocks. Another major CJK extension is also included in this version.

A beta version of the 10.0.0 Unicode Character Database files is available for public review. We strongly encourage implementers to review the summary description, download the beta 10.0.0 Unicode Character Database files, and test their programs with the new data, well before the end of the beta period. It is especially important to review the Notable Issues for Beta Reviewers.

We encourage users to check the code charts carefully to verify correctness of the new characters added to Unicode 10.0.0 and to ensure that there are no regressions in glyph shapes for previously encoded characters.

Summary description

Unicode character database (UCD)

Summary of beta charts

Single-block delta charts with yellow highlighting for new characters

Single-block charts for all of Unicode 10.0.0 (in preparation)

Code charts - single download (109 MB)

Auxiliary HTML charts for beta review

Related Unicode Technical Standards

In addition to the Unicode Standard proper, three other Unicode Technical Standards have significant text and data file updates that are correlated with the new additions for Unicode 10.0.0. Review of that text and data is also encouraged during the beta review period.

UTS #10, Unicode Collation Algorithm Data files

UTS #39, Unicode Security Mechanisms Data files

UTS #46, Unicode IDNA Compatibility Processing Data files

Review and Feedback

For guidance on how to focus your review, see the section Notable Issues for Beta Reviewers.

Any feedback should be reported using the contact form. Comments on the Unicode Standard Version 10.0.0 or the Unicode Character Database data files should refer to the beta review Public Review Issue #350. Comments on specific Version 10.0.0 UAXes and UTSes should refer to the respective Public Review Issue Numbers for each document, where available.

The comment period ends May 1, 2017. All substantive technical comments must have been received by that date for consideration at the May UTC meeting. Editorial comments (typos, etc.) may be still submitted after that date for consideration in the final editorial work.

Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 10.0.0 is final. It is inappropriate to cite these files as other than a work in progress. No products or implementations should be released based on the beta UCD data files—use only the final, approved Version 10.0.0 data files, expected in June 2017.

The Unicode Consortium provides early access to updated versions of the data files and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of Version 10.0.0.

The assignment of characters for Unicode 10.0.0 is now stable. There will be no further additions or modifications of code points and no further changes to character names. Please do not submit feedback requesting changes to code points or character names for Unicode 10.0.0, as such feedback is not actionable.

One of the main purposes of the beta review period is to verify and correct the preliminary character property assignments in the Unicode Character Database. Reviewers should check for property changes to existing Unicode 9.0.0 characters, as well as the property values for the new Unicode 10.0.0 character additions. The Auxiliary HTML charts include the new characters highlighted in yellow, with names appearing when hovering over a cell. These charts may be useful for reviewing information such as the default collation order, Script property assignments, and so forth during beta review.

To facilitate verification of the property changes and additions, diffable XML versions of the Unicode Character Database are available. These XML files are dated, so that people can check the details of changes that occurred during the beta review period. For more information, see the diffs.readme.txt file.

The beta review period is a good opportunity to add support for the new Unicode 10.0.0 characters in internal versions of software, so that software can be tested to verify that the new characters and property assignments do not cause problems when upgraded to Version 10.0.0 of Unicode.

Notable Issues for Beta Reviewers

Changes to Unicode Standard Annexes

Some of the Unicode Standard Annexes have modifications for Unicode 10.0.0, often in coordination with changes to character properties. Most notably for Unicode 10.0.0:

UAX #14, Unicode Line Breaking Algorithm

UAX #29, Unicode Text Segmentation

UAX #31, Unicode Identifier and Pattern Syntax

See the Modifications section of each Annex for details of the relevant changes.

Core Specification Update

The core specification is undergoing extensive review, with numerous additions for Version 10.0.0. Although the draft text for Version 10.0.0 is not yet available, specific reports of any technical or editorial issues in the currently published core specification are also welcome during the beta review period. Such reports will be taken into consideration for corrections to the Version 10.0.0 draft. (Note: The Unicode Consortium has ongoing opportunities for subject-matter volunteers: experts interested in contributing to or editing relevant parts of the core specification or other Unicode specifications.)

Script-specific Issues

Four new scripts have been added in Unicode 10.0. All of these additions are on Plane 1. Some of these scripts have particular attributes which may cause issues for implementations. The more important of these attributes are summarized here.

Zanabazar Square and Soyombo are complex, historic abugidas. They were modeled on Tibetan, and used to write Mongolian, Tibetan, and Sanskrit. The implementation of these scripts poses particular challenges, in particular for rendering. Implementers should check the proposal documents, which contain substantial details regarding rendering and other aspects of the text model for these scripts.

Masaram Gondi is another newly added complex script, inspired by the Brahmi model, but created more or less de novo, and with its own, distinct rendering issues.

Unicode 10.0 includes another large Unified CJK addition: CJK Extension F. This extension contains mostly rare characters, but also includes a number personal and placename characters important for government specifications in Japan, in particular.

21 more CJK ideographs were added at end of URO. Implementations often have hard-coded ranges for CJK ideographs, so should be checked carefully to ensure they pick up the new end range (U+9FEA).

A large collection of Japanese hentaigana has been added. These are effectively historic variants of Hiragana syllables.

New Data Files Added to the UCD

Several new data files have been added to the UCD:

NushuSources.txt. This file contains normative information on the source references for Nüshu characters. The file format is similar to the format of the Unihan data files and TangutSources.txt. Implementations which support that format for Unihan or Tangut data should be able to add support for Nüshu data in a similar manner.

VerticalOrientation.txt. Starting with Version 10.0.0 of the Unicode Standard, this data file, which lists the Vertical_Orientation property values, is formally included in the Unicode Character Database. The file format has not changed, but certain lines of data have been updated for consistency with other UCD files. Implementers are invited to report any issues that might have been inadvertently introduced during the migration of the file.

DerivedName.txt (in the "extracted/" subdirectory). This file provides a complete listing of the formal Name property values of characters. In the case of algorithmically derived names, only those names that follow a simple pattern of a prefix followed by a code point value are abbreviated. The names of Hangul syllable characters, as well as all other character names, are listed individually. Implementations can use this file to conveniently retrieve the formal character names instead of deriving them themselves.

General Property Issues

There are a number of issues related to particular character properties:

UCD properties which depend on emoji character properties have been synchronized with Emoji 5.0.

The line breaking properties of a number of emoji characters have been updated as a result of changes in emoji zwj sequences.

The enumerated property Vertical_Orientation has been incorporated in the UCD, as part of the progression from UTR to UAX of the Unicode Vertical Text Layout specification.

The characters of two newly encoded scripts, Soyombo and Zanabazar Square, as well as the unassigned code points in their blocks, have been assigned the Vertical_Orientation property value Upright (vo=U). In spite of the affinity of those scripts with Tibetan, that assignment was based on a few instances of text laid out vertically. Implementers are encouraged to provide feedback on the current Vertical_Orientation property values for those scripts.

A new normative binary property Regional_Indicator has been introduced. This property is referenced in the line breaking and text segmentation algorithms, to assist in the determination of correct text boundaries around emoji flag sequences.

The Script and Script_Extensions properties of U+061C ARABIC LETTER MARK (ALM) have been revised, so that the character now has the same effects on digit substitution as regular Arabic letters.

A set of new Arabic joining groups has been added for Malayalam Garshuni letters (in the Syriac script).

The derivation of the Word_Break property value ALetter was extended to include 36 modifier letters.

The following assignments of Line_Break property values deserve careful review. Implementers and specialists are invited to provide feedback on these assignments.

U+20BF BITCOIN SIGN has been assigned lb=PR, the default for currency symbols.

Soyombo cluster-initial letters U+11A86..U+11A89 have been assigned lb=AL, instead of the erroneous lb=CM in the proposal document.

The Soyombo and Zanabazar Square shad punctuation marks, 11A9B..U+11A9C and U+11A42..U+11A43, have been assigned lb=BA, as proposed. However, corresponding shad characters in Tibetan, ’Phags-pa, and Marchen are lb=EX. The difference is that EX would prohibit indirect line breaks compared to BA.

Three symbols which occur as final elements of emoji zwj sequences have been given Emoji properties while preserving their current Line_Break values. These three symbols are U+2640 FEMALE SIGN, U+2642 MALE SIGN, and U+2695 STAFF OF AESCULAPIUS.

Unihan-related Issues

Because a major new CJK extension is part of Unicode 10.0, all Unihan properties should be reviewed carefully. Additionally, the following deserve special attention:

A new full radical-stroke index is available, which includes CJK Extension F and the 21 new characters added at the end of the URO.

The addition of new CJK sources means that adjustments have been made to the regex expressions used to validate the kIRG_...Source tags in the Unihan database. See UAX #38 for details.

The newly added character U+9FEA is the result of a formal disunification from U+3E02.

Standardized Variation Sequences

There have been significant changes to StandardizedVariants.txt and regarding the documentation of variation sequences involving emoji, which are now known more specifically as emoji presentation sequences and text presentation sequences.

All of the emoji and text presentation sequences were moved from the UCD file StandardizedVariants.txt to the UTS #51 data file emoji-variation-sequences.txt. The latter is a new data file accompanying Version 5.0 of UTS #51, Unicode Emoji, whose emoji character repertoire corresponds to Unicode 10.0. New emoji and text presentation sequences are also included in emoji-variation-sequences.txt. Implementations should be prepared to consume such sequence data from the new file and, in general, to use Unicode Emoji Version 5.0 data in conjunction with UCD 10.0 data.

Other changes in StandardizedVariants.txt include corrections to the labels of a few Mongolian standardized variation sequences, but without changes to the actual character sequences.

Also, the documentation file, StandardizedVariants.html has been removed altogether, as its function has been superseded by other documentation. Representative glyphs for the standardized variation sequences are still shown in the Unicode code charts, but emoji and text presentation sequences are now displayed in the emoji charts, instead.

Code Charts

As always, careful review of the updated code charts for Version 10.0 is advised, especially for all newly added scripts. Particular issues to take note of include:

Emoji and text presentation sequences are no longer displayed in the Unicode code charts. They are documented instead in the emoji charts area. For the emoji charts currently in beta review for Emoji 5.0, see Emoji 5.0 beta charts.

A number of representative glyphs for pictographic symbols in the Unicode code charts have been updated, as part of the ongoing updating of glyphs for emoji characters.

There is an outstanding glyph erratum approved by the UTC for a few Brahmi characters but not yet reflected into the Unicode code charts.

Collation-related Issues

The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 10.0 repertoire for UCA 10.0. For the most part, the additions for new scripts and other characters are unremarkable, but there are a couple of items that implementers of collation should be aware of:

The large hentaigana collection is simply tacked on to the end of the range of primary weights for Japanese syllabaries. Contrary to possible expectations, hentaigana are not interfiled with standard Hiragana syllables with the same sounds, in part because a significant proportion of the hentaigana characters have historic associations with more than one Japanese syllable. Also, the collation order of U+1B001 was modified, to ensure that it occurs in the slot for "e-1" in the hentaigana collection.

The addition of another ideographic script, Nüshu, necessitates the addition of another implicit weight base to the UCA algorithm. This is also reflected in a second @implicitweight line at the top of DUCET. Implementations of UCA will need to be updated to take this change in implicit weighting into account.

Other Issues

Please also check the following specific items carefully:

Four formal character name aliases of type correction have been assigned in NameAliases.txt to the jamos U+11EC..U+11EF, which contain yesieung rather than ieung components.

A formal name alias of type correction was added for the previously encoded archaic Hiragana syllable U+1B001. This addition was to ensure the identification of that earlier encoded character as formally being part of the set of hentaigana.

Nüshu was added to UnicodeData.txt with a start line and end line, similar to the way that data file handles CJK unified ideographs. Parsers of UnicodeData.txt may need to be updated to handle this new range.

The following blocks are new in Unicode 10.0.0. Check implementations carefully for any range or property value assumptions regarding these new blocks. See also the single-block delta charts.

Range Block Name

0860..086F Syriac Supplement

11A00..11A4F Zanabazar Square

11A50..11AAF Soyombo

11D00..11D5F Masaram Gondi

1B100..1B12F Kana Extended-A

1B170..1B2FF Nushu

2CEB0..2EBEF CJK Extension F

Some blocks have also had font updates; see the single-block delta charts for details. In such cases, careful review of the blocks in question is advised, to ensure that there have not been any regressions in representative glyph display.

General Issues

For current proposed updates to the particular UAXes, see Proposed Updates for Standard Annexes or use the links in the navigation bar on this page. Particular issues in the UAXes may also be the focus of specific Public Review Issues. Each proposed textual change in a UAX is highlighted, so that you can focus your review on those sections if you have limited time. The changes are also listed in detail in the Modifications sections (linked from the table of contents of each document), and are summarized in UAX changes, so you can check on those areas that might be of most interest.

Some links between beta documents and the proposed updates for UAXes will not work correctly during the beta review period. This is a known problem which does not need to be reported, as such links point to the eventual final names or revision numbers for the released versions.

Stability

Certain character properties for newly assigned characters cannot be changed after the formal release of each version of the standard, because of the Character Encoding Stability Policy. Such character property values need special attention during the beta review process, as they cannot be corrected after publication. These include:

Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.

The determination of whether a character is included in identifiers (XID_Start, XID_Continue).

Case mappings and case foldings.