[Unicode]  Unicode 10.0.0 Home | Site Map | Search
 

Unicode® 10.0.0 (DRAFT)

2017 June XX (Announcement)

This page summarizes the important changes for the Unicode Standard, Version 10.0.0. This version supersedes all previous versions of the Unicode Standard.

The Unicode Character Database, Code Charts, and Annexes for Version 10.0 will be released on June XX, 2017. The core specification (the PDF chapters) of Version 10.0 is still pending publication due to the extensive editorial work required for the new content additions. Until final publication, the links to individual chapters of the core specification will not be activated. An announcement will be made when the core specification for Version 10.0 is available. In the meantime, implementers can continue to reference the relevant sections of the most recent version of the core specification.
A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration

A. Summary

Unicode 10.0 adds 8,518 characters, for a total of 136,690 characters. These additions include 4 new scripts and 56 new emoji characters.

The new scripts and characters in Version 10.0 add support for lesser-used languages and unique written requirements worldwide, including:

  • Masaram Gondi, used to write Gondi in Central and Southeast India
  • Nüshu, used by women in China to write poetry and other discourses until the late twentieth century
  • Soyombo and Zanabazar Square, used in historic Buddhist texts to write Sanskrit, Tibetan, and Mongolian
  • Syriac letters used for writing Suriyani Malayalam, also known as Garshuni and Syriac Malayalam
  • Gujarati signs used for the transliteration of the Arabic script into Gujarati by Ismaili Khoja communities
  • A set of 285 Hentaigana characters used in Japan (historic variants of Hiragana characters)
  • CJK Extension F (7,473 Han characters)

Important symbol additions include:

  • Bitcoin sign
  • 56 emoji characters (full list)
  • A set of Typicon marks and symbols

Synchronization

Several other important Unicode specifications have been updated for Version 10.0. The following three Unicode Technical Standards are versioned in synchrony with the Unicode Standard, because their data files cover the same repertoire. All have been updated to Version 10.0:

Additionally, Version 10.0 of the Unicode Standard makes use of the emoji-related data and behavior specified in Version 5.0 of UTS #51:

Some of the changes in Version 10.0 and associated Unicode Technical Standards may require modifications to implementations. For more information, see the migration and modification sections of UTS #10, UTS #39, UTS #46, and UTS #51.

This version of the Unicode Standard is also synchronized with 10646:2017, fifth edition, plus the following additions from Amendment 1 to the fifth addition:

  • 56 emoji characters
  • 285 hentaigana
  • 3 additional Zanabazar Square characters

See Sections D through H below for additional details regarding the changes in this version of the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.

B. Technical Overview

Version 10.0 of the Unicode Standard consists of:

  • The core specification
  • The code charts (delta and archival) for this version
  • The Unicode Standard Annexes
  • The Unicode Character Database (UCD)

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Core Specification

The core specification will be available as a single pdf for viewing. (NN MB) Available June 2017. Links are also available in the navigation bar on the left of this page to access individual chapters and appendices of the core specification.

Code Charts

Several sets of code charts are available. They serve different purposes:

  • The latest set of code charts for the Unicode Standard is available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.

For Unicode 10.0.0 in particular two additional sets of code chart pages are provided:

  • A set of delta code charts showing the new blocks and any blocks in which characters were added for Unicode 10.0.0. The new characters are visually highlighted in the charts.
  • A set of archival code charts that represents the entire set of characters, names and representative glyphs at the time of publication of Unicode 10.0.0.

The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.

Unicode Standard Annexes

Links to the individual Unicode Standard Annexes are available in the navigation bar on the left of this page. The list of signification changes in the content of the Unicode Standard Annexes for Version 10.0 can be found in Section G below.

Unicode Character Database

Data files for Version 10.0 of the Unicode Character Database are available. NOTE: During the beta review period for Unicode 10.0, these data files are not final, and only have draft status. The ReadMe.txt in that directory provides a roadmap to the functions of the various subdirectories. Zipped versions of the UCD for bulk download will be available, as well.

Version References

Version 10.0.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 10.0.0, (Mountain View, CA: The Unicode Consortium, 2017. ISBN 978-1-936213-16-0)
http://www.unicode.org/versions/Unicode10.0.0/

The terms “Version 10.0” or “Unicode 10.0” are abbreviations for the full version reference, Version 10.0.0.

The citation and permalink for the latest published version of the Unicode Standard is:

The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/

A complete specification of the contributory files for Unicode 10.0 is found on the page Components for 10.0.0. That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.

Errata

Errata incorporated into Unicode 10.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 10.0, see the list of current Updates and Errata.

C. Stability Policy Update

There were no significant changes to the Stability Policy of the core specification between Unicode 9.0 and Unicode 10.0.

D. Textual Changes and Character Additions

Four new scripts were added with accompanying new block descriptions:

Script Number of
Characters
Masaram Gondi 75
Nushu 396
Soyombo 80
Zanabazar Square 72

Changes in the Unicode Standard Annexes are listed in Section G.

Character Assignment Overview

8,518 characters have been added. Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see Delta Code Charts.

E. Conformance Changes

A formal definition of "block" has been added to the Conformance chapter of the core specification for Unicode 10.0 as D10b.

F. Changes in the Unicode Character Database

The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 10.0 can be found in UAX #44, Unicode Character Database. The changes listed there include character additions and property revisions to existing characters that will affect implementations. Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in Section M.

G. Changes in the Unicode Standard Annexes

In Version 10.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.

Unicode Standard Annex Changes
UAX #9
Unicode Bidirectional Algorithm
Clarified the equivalence between directional formatting characters and HTML5 markup, pointing out the differences from HTML4.0. Updated the table in Section 2.7, Markup and Formatting Characters with explicit directional formatting characters and equivalent CSS.
UAX #11
East Asian Width
Referred to the new Regional_Indicator property. Updated references to UTS #51, Unicode Emoji, and terminology derived from that UTS.
UAX #14
Unicode Line Breaking Algorithm
Removed Section 7, Pair Table Based Implementation, and other references to it. Changed the definition of lb=Regional_Indicator to refer to the new Regional_Indicator property. Made corrections to descriptions of ID and NS classes.
UAX #15
Unicode Normalization Forms
No significant changes in this version.
UAX #24
Unicode Script Property
No significant changes in this version.
UAX #29
Unicode Text Segmentation
Replaced the start-of-line anchor (^) with start of text (sot) in rules GB12 and WB15. Changed the Word_Break property value of U+02D7 MODIFIER LETTER MINUS SIGN from MidLetter to ALetter in Table 3, Word_Break Property Values. Assigned the Word_Break property value ALetter to 34 other characters in Table 3, Word_Break Property Values.
UAX #31
Unicode Identifier and Pattern Syntax
Withdrew the table of aspirational use scripts, moving the contents to the table of limited use scripts, and added a note explaining the reason.
UAX #34
Unicode Named Character Sequences
No significant changes in this version.
UAX #38
Unicode Han Database (Unihan)
Updated the regular expression for the kIRG_HSource field, updated terminology to reflect the difference between the IRG's U-source and the UTC-source, and added references to the CJK Unified Ideographs Extension F block.
UAX #41
Common References for Unicode Standard Annexes
Updated all references for Unicode 10.0.
UAX #42
Unicode Character Database in XML
Added new code point attributes, values, and patterns.
UAX #44
Unicode Character Database
Updated the description of the Name property value. Updated the discussion of immutable properties and the list of those properties in Table 19. Added new Section 5.13 Property APIs. Added discussion of new data file DerivedName.txt to Section 5.4, Derived Extracted Properties. Added new Section 2.1.3, Properties Dependent on External Specifications to discuss the dependency of UCD segmentation properties on the non-UCD emoji properties. Added new Section 5.14, Character Age to further explain the details of the Age property and its derivation.
UAX #45
U-Source Ideographs
Updated terminology to reflect the difference between the IRG's U-source and the UTC-source. Updates to contents and status values.
UAX #50
Unicode Vertical Text Layout
Newly added as an annex in 10.0, converted from an earlier, approved UTR.

H. Changes in Synchronized Unicode Technical Standards

There are also significant revisions in the Unicode Technical Standards whose versions are synchronized with the Unicode Standard. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UTS, linked directly from the following list of UTSes.

Unicode Technical Standard Changes
UTS #10
Unicode Collation Algorithm
Major rewrite to add formal definitions and to clarify the statement of the main algorithm. Added Nüshu to the list of siniform ideographic scripts given implicit primary weights similar to Han ideographs.
UTS #39
Unicode Security Mechanisms
Removed references to aspirational use scripts because that category has been merged with limited use scripts. That change impacted the results from Section 5.2, Restriction-Level Detection, for the five affected scripts. Extensively reformulated the text in Section 4, Confusable Detection and Section 5, Detection Mechanisms, for clarity and precision. Removed subparts 4 through 6 of conformance clause C2.
UTS #46
Unicode IDNA Compatibility Processing
Added three new parameters which allow implementations to reflect current practice in browsers: CheckHyphens, CheckBidi, and CheckJoiners. Updated the counts in Table 4, IDNA Comparisons for Version 10.0, and improved the explanation of the divergence from IDNA2008.

M. Implications for Migration

There are a significant number of changes in Unicode 10.0 which may impact implementations which are upgrading to Version 10.0 from earlier versions of the standard. The most important of these are listed and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.

Script-related Changes

Version 10.0 adds four new scripts, so implementations which process script data should be carefully checked. All of these additions are on Plane 1. Some of these scripts have particular attributes which may cause issues for implementations.

TBD

Numeric-related Issues

TBD

Segmentation-related Changes

TBD

CJK/Unihan Changes

TBD

Standardized Variation Sequences

TBD

New Properties

TBD

Code Charts

TBD