BETA Unicode® 10.0.0
Note: The beta review period for Unicode 10.0.0 has closed,
as of
May 1, 2017. Feedback received during the public review can be
referred to from PRI #350.
This beta review page is
left active, however, for convenience of access to the prepublication versions
of the Unicode 10.0.0 data files and annexes, until the formal release
planned for mid-June, 2017. |
The next version of the Unicode Standard will be Version 10.0.0, planned for release in
June, 2017. This version updates several annexes to deal with
segmentation issues and adds significant new repertoire.
A total of 8,518 new characters are encoded, including
56 new emoji characters,
4 new scripts, and multiple additions to existing blocks. Another major CJK extension is also
included in this version.
A beta version of the 10.0.0 Unicode Character Database files is available for public review.
We strongly encourage implementers to review the summary description,
download the beta 10.0.0 Unicode Character Database files,
and test their programs with the new data, well before the end of the beta period. It is especially important
to review the Notable Issues for Beta Reviewers.
We encourage users to check the code charts carefully
to verify correctness of the new characters added to Unicode 10.0.0 and to ensure
that there are no regressions
in glyph shapes for previously encoded characters.
Related Unicode Technical Standards
In addition to the Unicode Standard proper, three other Unicode Technical
Standards have significant text and data file updates that are
correlated with the new additions for Unicode 10.0.0. Review of that text
and data is also encouraged during the beta review period.
Review and Feedback
For guidance on how to focus your review, see the section
Notable Issues for Beta Reviewers.
Any feedback should be
reported using the contact form.
Comments on the Unicode Standard Version 10.0.0
or the Unicode Character Database data files should refer to the beta review
Public Review Issue #350.
Comments on specific Version 10.0.0 UAXes and UTSes should refer to the respective
Public Review Issue Numbers
for each document, where available.
The comment period ends
May 1, 2017.
All substantive technical comments must have been received by that date for
consideration at the May UTC meeting. Editorial comments (typos,
etc.) may be still submitted after that date for consideration in the final
editorial work.
Note: All beta files may be updated, replaced, or
superseded by other files at any time. The beta files will be
discarded once Unicode 10.0.0 is final. It is inappropriate to cite
these files as other than a work in progress. No
products or implementations should be released based on the beta
UCD data files—use only the final, approved Version 10.0.0 data
files, expected in June 2017.
The Unicode Consortium provides early access to updated versions of the data files
and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of
Version 10.0.0.
The assignment of characters for Unicode 10.0.0 is
now stable. There will be no further
additions or modifications of code points and no further changes to character names.
Please do not submit feedback requesting changes to code points
or character names for Unicode 10.0.0, as such feedback is not actionable.
One of the main purposes of the beta review period is to verify and
correct the preliminary character property assignments in the Unicode Character
Database. Reviewers should check for property changes to existing Unicode 9.0.0
characters, as well as the property values for the new Unicode 10.0.0 character
additions. The Auxiliary
HTML charts include the new characters highlighted in yellow, with names appearing when hovering over a cell. These charts
may be useful for reviewing information such as the default collation order,
Script property assignments, and so forth during beta review.
To facilitate verification of the property changes and additions,
diffable XML versions
of the Unicode Character Database are available. These XML
files are dated, so that people can check the details of changes that occurred
during the beta review period. For more information,
see the
diffs.readme.txt
file.
The beta review period is a good opportunity to add support for the new
Unicode 10.0.0 characters in internal versions of software, so that software can
be tested to verify that the new characters and property assignments do not cause
problems when upgraded to Version 10.0.0 of Unicode.
Notable Issues for Beta Reviewers
Changes to Unicode Standard Annexes
Some of the Unicode Standard Annexes have modifications for
Unicode 10.0.0, often in coordination with changes to character properties.
Most notably for Unicode 10.0.0:
See the Modifications section of each Annex for details of the relevant changes.
Core Specification Update
The core specification is undergoing extensive review, with
numerous additions for Version 10.0.0. Although the draft text for Version 10.0.0
is not yet available, specific reports of any technical or editorial
issues in the currently published core specification
are also welcome during the beta review
period. Such reports will be taken into consideration for corrections
to the Version 10.0.0 draft. (Note: The Unicode Consortium has ongoing
opportunities for subject-matter volunteers: experts interested in contributing to or
editing relevant parts of the core specification or other Unicode specifications.)
Script-specific Issues
Four new scripts have been added in Unicode 10.0. All of these additions are
on Plane 1. Some of these scripts have
particular attributes which may cause issues for implementations. The more
important of these attributes are summarized here.
- Zanabazar Square and Soyombo are complex, historic abugidas. They were modeled
on Tibetan, and used to write Mongolian, Tibetan, and Sanskrit. The implementation
of these scripts poses particular challenges, in particular for rendering.
Implementers should check the proposal documents, which contain substantial
details regarding rendering and other aspects of the text model for these scripts.
- Masaram Gondi is another newly added complex script, inspired by the Brahmi model, but
created more or less de novo, and with its own, distinct rendering issues.
- Unicode 10.0 includes another large Unified CJK addition: CJK Extension F.
This extension contains mostly rare characters, but also includes a number personal
and placename characters important for government specifications in Japan, in
particular.
- 21 more CJK ideographs were added at end of URO. Implementations often have
hard-coded ranges for CJK ideographs, so should be checked carefully to
ensure they pick up the new end range (U+9FEA).
- A large collection of Japanese hentaigana has been added. These are effectively
historic variants of Hiragana syllables.
New Data Files Added to the UCD
Several new data files have been added to the UCD:
- NushuSources.txt. This file contains normative information on the source references
for Nüshu characters. The file format is similar to the format of the Unihan data
files and TangutSources.txt. Implementations which support that format for Unihan or
Tangut data should be able to add support for Nüshu data in a similar manner.
- VerticalOrientation.txt. Starting with Version 10.0.0 of the Unicode Standard, this
data file, which lists the Vertical_Orientation property values, is formally included
in the Unicode Character Database. The file format has not changed, but certain lines
of data have been updated for consistency with other UCD files.
Implementers are invited to report any issues that might have been inadvertently
introduced during the migration of the file.
- DerivedName.txt (in the "extracted/" subdirectory). This file provides a complete
listing of the formal Name
property values of characters. In the case of algorithmically derived names,
only those names that follow a simple pattern of a prefix followed by a code
point value are abbreviated. The names of Hangul syllable characters,
as well as all other character names, are listed individually.
Implementations can use this file to conveniently retrieve the formal character
names instead of deriving them themselves.
General Property Issues
There are a number of issues related to particular character properties:
- UCD properties which depend on emoji character properties have been
synchronized with Emoji 5.0.
- The line breaking properties of a number of emoji characters have
been updated as a result of changes in emoji zwj sequences.
- The enumerated property Vertical_Orientation has been incorporated in the UCD,
as part of the progression from UTR to UAX of the Unicode Vertical Text Layout
specification.
- The characters of two newly encoded scripts, Soyombo and Zanabazar Square,
as well as the unassigned code points in their blocks, have been assigned the
Vertical_Orientation property value Upright (vo=U). In spite of the affinity of
those scripts with Tibetan, that assignment was based on a few instances of
text laid out vertically. Implementers are encouraged to provide feedback
on the current Vertical_Orientation property values for those scripts.
- A new normative binary property Regional_Indicator has been introduced.
This property is referenced in the line breaking and text segmentation algorithms,
to assist in the determination of correct text boundaries around emoji flag sequences.
- The Script and Script_Extensions properties of U+061C ARABIC LETTER MARK (ALM)
have been revised, so that the character now has the same effects on digit substitution
as regular Arabic letters.
- A set of new Arabic joining groups has been added for Malayalam
Garshuni letters (in the Syriac script).
- The derivation of the Word_Break property value ALetter was extended to
include 36 modifier letters.
The following assignments of Line_Break property values deserve careful review. Implementers and specialists are invited to provide feedback on these assignments.
- U+20BF BITCOIN SIGN has been assigned lb=PR, the default for currency symbols.
- Soyombo cluster-initial letters U+11A86..U+11A89 have been assigned lb=AL,
instead of the erroneous lb=CM in the proposal document.
- The Soyombo and Zanabazar Square shad punctuation marks, 11A9B..U+11A9C and U+11A42..U+11A43,
have been assigned lb=BA, as proposed. However, corresponding shad characters in
Tibetan, ’Phags-pa, and Marchen are lb=EX.
The difference is that EX would prohibit indirect line breaks compared to BA.
- Three symbols which occur as final elements of emoji zwj sequences have been given
Emoji properties while preserving their current Line_Break values.
These three symbols are U+2640 FEMALE SIGN, U+2642 MALE SIGN, and U+2695 STAFF OF AESCULAPIUS.
Unihan-related Issues
Because a major new CJK extension is part of Unicode 10.0, all Unihan
properties should be reviewed carefully. Additionally, the following
deserve special attention:
- A new full radical-stroke
index is available, which includes CJK Extension F and the 21 new characters
added at the end of the URO.
- The addition of new CJK sources means that adjustments have been made
to the regex expressions used to validate the kIRG_...Source tags in
the Unihan database. See UAX #38 for details.
- The newly added character U+9FEA is the result of a formal disunification
from U+3E02.
Standardized Variation Sequences
There have been significant changes to StandardizedVariants.txt and regarding the
documentation of variation sequences involving emoji, which are now known more specifically as
emoji presentation sequences and text presentation sequences.
- All of the emoji and text presentation sequences were moved from the UCD file
StandardizedVariants.txt to the UTS #51 data file emoji-variation-sequences.txt.
The latter is a new data file accompanying Version 5.0 of UTS #51, Unicode Emoji,
whose emoji character repertoire corresponds to Unicode 10.0.
New emoji and text presentation sequences are also included in emoji-variation-sequences.txt.
Implementations should be prepared to consume such sequence data from the new file and,
in general, to use Unicode Emoji Version 5.0 data in conjunction with UCD 10.0 data.
- Other changes in StandardizedVariants.txt include corrections to the labels of a
few Mongolian standardized variation sequences, but without changes to the actual
character sequences.
- Also, the documentation file, StandardizedVariants.html has been removed
altogether, as its function has been superseded by other documentation.
Representative glyphs for the standardized variation sequences are still shown
in the Unicode code charts, but emoji and text presentation sequences
are now displayed in the emoji charts, instead.
Code Charts
As always, careful review of the updated code charts for Version 10.0 is advised,
especially for all newly added scripts.
Particular issues to take note of include:
- Emoji and text presentation sequences are no longer displayed in the Unicode code charts.
They are documented instead in the emoji charts area. For the emoji charts currently
in beta review for Emoji 5.0,
see Emoji 5.0 beta charts.
- A number of representative glyphs for pictographic symbols in the Unicode code charts have been
updated, as part of the ongoing updating of glyphs for emoji characters.
- There is an outstanding glyph erratum approved by the UTC for a few Brahmi characters
but not yet reflected into the Unicode code charts.
Collation-related Issues
The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 10.0
repertoire for UCA 10.0. For the most part, the additions for new scripts and other
characters are unremarkable, but there are a couple of items that implementers
of collation should be aware of:
- The large hentaigana collection is simply tacked on to the end of the
range of primary weights for Japanese syllabaries. Contrary to possible
expectations, hentaigana are not interfiled with standard Hiragana syllables
with the same sounds, in part because a significant proportion of the hentaigana
characters have historic associations with more than one Japanese syllable.
Also, the collation order of U+1B001 was modified, to ensure that it occurs
in the slot for "e-1" in the hentaigana collection.
- The addition of another ideographic script, Nüshu, necessitates the addition of
another implicit weight base to the UCA algorithm. This is also reflected in
a second @implicitweight line at the top of DUCET. Implementations of UCA
will need to be updated to take this change in implicit weighting into account.
Other Issues
Please also check the following specific items carefully:
- Four formal character name aliases of type correction have been assigned in NameAliases.txt
to the jamos U+11EC..U+11EF, which contain yesieung rather
than ieung components.
- A formal name alias of type correction was added for the previously
encoded archaic Hiragana syllable U+1B001.
This addition was to ensure the identification of that earlier encoded character
as formally being part of the set of hentaigana.
- Nüshu was added to UnicodeData.txt with a start line and end line, similar to the way that data file handles CJK unified ideographs. Parsers of UnicodeData.txt may need to be updated to handle this new range.
The following blocks are new in Unicode 10.0.0. Check implementations
carefully for any range or property value assumptions regarding
these new blocks. See also the single-block delta charts.
Range |
Block Name |
0860..086F |
Syriac Supplement |
11A00..11A4F |
Zanabazar Square |
11A50..11AAF |
Soyombo |
11D00..11D5F |
Masaram Gondi |
1B100..1B12F |
Kana Extended-A |
1B170..1B2FF |
Nushu |
2CEB0..2EBEF |
CJK Extension F |
Some blocks have also had font updates; see the
single-block delta charts for details.
In such cases, careful review of the blocks in question
is advised, to ensure that there have not been any
regressions in representative glyph display.
General Issues
For current proposed updates to the particular UAXes, see
Proposed Updates for Standard Annexes
or use the links in the navigation bar on this page.
Particular issues in the UAXes may also be the focus of specific
Public Review Issues.
Each proposed textual change in a UAX is highlighted, so that you can focus
your review on those sections if you have limited time. The changes
are also listed in detail in the Modifications sections (linked from the table
of contents of each document), and are summarized in
UAX changes,
so you can check on those areas that might be of most
interest.
Some links between beta documents and the proposed
updates for UAXes will not work correctly during the
beta review period. This is a known problem which does
not need to be reported, as such links point to
the eventual final names or revision numbers for the
released versions.
Stability
Certain character properties for newly assigned characters cannot be
changed after the formal release of each version of the standard, because of the
Character Encoding Stability Policy.
Such character property values need special attention during the beta review process, as they
cannot be corrected after publication. These include:
- Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.
- The determination of whether a character is included in identifiers (XID_Start, XID_Continue).
- Case mappings and case foldings.