BETA UnicodeĀ® 11.0.0
Note: The beta review period for Unicode 11.0.0 has closed,
as of
April 23, 2018. Feedback received during the public review can be
referred to from PRI #372.
This beta review page is
left active, however, for convenience of access to the prepublication versions
of the Unicode 11.0.0 data files and annexes, until the formal release
planned for June 5, 2018. |
The next version of the Unicode Standard will be Version 11.0.0, planned for release on
June 5, 2018. This version updates several annexes to deal with
segmentation issues and adds significant new repertoire.
A total of 684 new characters are encoded, including
66 new emoji characters,
7 new scripts, and multiple additions to existing blocks.
A beta version of the 11.0.0 Unicode Character Database files is available for public review.
We strongly encourage implementers to review the summary description,
download the beta 11.0.0 Unicode Character Database files,
and test their programs with the new data, well before the end of the beta period. It is especially important
to review the Notable Issues for Beta Reviewers.
We encourage users to check the code charts carefully
to verify correctness of the new characters added to Unicode 11.0.0 and to ensure
that there are no regressions
in glyph shapes for previously encoded characters.
Related Unicode Technical Standards
In addition to the Unicode Standard proper, four other Unicode Technical
Standards have significant text and data file updates that are
correlated with the new additions for Unicode 11.0.0. Review of that text
and data is also encouraged during the beta review period.
Review and Feedback
For guidance on how to focus your review, see the section
Notable Issues for Beta Reviewers.
Any feedback should be
reported using the contact form.
Comments on the Unicode Standard Version 11.0.0
or the Unicode Character Database data files should refer to the beta review
Public Review Issue #372.
Comments on specific Version 11.0.0 UAXes and UTSes should refer to the respective
Public Review Issue Numbers
for each document, where available.
The comment period ends
April 23, 2018.
All substantive technical comments must have been received by that date for
consideration at the May UTC meeting. Editorial comments (typos,
etc.) may be still submitted after that date for consideration in the final
editorial work.
Note: All beta files may be updated, replaced, or
superseded by other files at any time. The beta files will be
discarded once Unicode 11.0.0 is final. It is inappropriate to cite
these files as other than a work in progress. No
products or implementations should be released based on the beta
UCD data files—use only the final, approved Version 11.0.0 data
files, expected on June 5, 2018.
The Unicode Consortium provides early access to updated versions of the data files
and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of
Version 11.0.0.
The assignment of characters for Unicode 11.0.0 is
now stable. There will be no further
additions or modifications of code points and no further changes to character names.
Please do not submit feedback requesting changes to code points
or character names for Unicode 11.0.0, as such feedback is not actionable.
One of the main purposes of the beta review period is to verify and
correct the preliminary character property assignments in the Unicode Character
Database. Reviewers should check for property changes to existing Unicode 10.0.0
characters, as well as the property values for the new Unicode 11.0.0 character
additions. The Auxiliary
HTML charts include the new characters highlighted in yellow, with names
appearing when hovering over a cell. These charts
may be useful for reviewing information such as the default collation order,
Script property assignments, and so forth during beta review.
To facilitate verification of the property changes and additions,
diffable XML versions
of the Unicode Character Database are available. These XML
files are dated, so that people can check the details of changes that occurred
during the beta review period. For more information,
see the
diffs.readme.txt
file.
The beta review period is a good opportunity to add support for the new
Unicode 11.0.0 characters in internal versions of software, so that software can
be tested to verify that the new characters and property assignments do not cause
problems when upgraded to Version 11.0.0 of Unicode.
Notable Issues for Beta Reviewers
Changes to Unicode Standard Annexes
Some of the Unicode Standard Annexes have modifications for
Unicode 11.0.0, often in coordination with changes to character properties.
Most notably for Unicode 11.0.0:
- UAX #29 handling of grapheme cluster boundary determination has undergone
a significant update, to better handle consonants linked by viramas, so as
to provide better segmentation of Indic phonological syllables. Implementers
of segmentation should carefully check their property classes and rules.
See the Modifications section of each Annex for details of the relevant changes.
Core Specification Update
The core specification is undergoing extensive review, with
numerous additions for Version 11.0.0. Although the draft text for Version 11.0.0
is not yet available, specific reports of any technical or editorial
issues in the currently published core specification
are also welcome during the beta review
period. Such reports will be taken into consideration for corrections
to the Version 11.0.0 draft. (Note: The Unicode Consortium has ongoing
opportunities for subject-matter volunteers: experts interested in contributing to or
editing relevant parts of the core specification or other Unicode specifications.)
Script-specific Issues
7 new scripts have been added in Unicode 11.0.0. Some of these scripts have
particular attributes which may cause issues for implementations. The more
important of these attributes are summarized here.
- The Hanifi Rohingya script is a new RTL script, with numbers written LTR, as in Arabic.
- The tatweel (U+0640) has been extended for use in Hanifi Rohingya and Sogdian.
- There are two new sets of vigesimal (base 20) numerals, one for the Medefaidrin script, and another for Mayan. The Mayan numerals are added for specialty use, as for page numbers, in advance of the encoding of the full Mayan script.
- Indic Siyaq numerals have complex formatting requirements, when combined to
represent large numbers.
New Data Files Added to the UCD
- A new data file has been added to the UCD: EquivalentUnifiedIdeograph.txt.
That data file contains the mapping values for the new property,
Equivalent_Unified_Ideograph (EqUIdeo).
Casing Issues
There has been a very significant change to casing behavior for the Georgian
script. A new set of Mtavruli capital letters (U+1C90..U+1CBA, U+1CBD..U+1CBF)
has been added to Unicode 11.0.0,
with case mappings to the existing Mkhedruli letters (U+10D0..U+10FA, U+10FD..U+10FF).
In prior versions of the Unicode Standard, Mkhedruli Georgian was considered to
be a monocameral (non-casing) script, and the Mkhedruli Georgian letters were gc=Lo.
Starting with Version 11.0.0, those Mkhedruli Georgian letters are now gc=Ll, and
have uppercase mappings to Mtavruli Georgian capital letters. This change will
have major implications for Georgian implementations, including changes for
input methods, fonts, casing, and string matching. Existing implementations
have treated Mtavruli headlines and other uses for textual emphasis as a text
style, so there will also be significant issues for document conversion and
upgrade.
Another complication for Georgian is that the primary orthography does not use
titlecasing, and the Mkhedruli Georgian letters do not have titlecase mappings to
Mtavruli letters. This is unique among bicameral systems in the Unicode Standard,
so casing implementations should be prepared for this exception.
General Character Property Issues
There are a number of issues related to particular character properties:
- New GCB and WB segmentation property values for the revised algorithms to better handle Indic phonological syllables (aksaras). (See also UAX #29.) A couple of emoji-related property values are no longer used for segmentation, as a consequence of the changes in UAX #29.
- GCB=Extend no longer matches Grapheme_Extend=Y, as a result of its partitioning to factor out a new class, GCB=Virama. WB=Extend and SB=Extend are unaffected.
- In prior versions of the UCD, cursive joining scripts which had
any Joining_Group values assigned included distinct values for all
characters that participate in cursive joining, including all of
the Joining_Group singletons (classes containing only a single
character). Starting with Unicode 11.0.0 and going forward,
explicit Joining_Group values are assigned only to characters which
do not constitute singleton classes. This new convention is applicable to
the two newly encoded cursive joining scripts: Hanifi Rohingya and Sogdian.
Implementations may need to take into account this discontinuity in how
Joining_Group values are assigned to cursive joining scripts.
- Bidi mirroring: Unicode 11.0.0 now adds formal recognition of a number of
previously encoded mathematical
characters as forming mirroring pairs. This means that there is now a further
deviation between the mappings defined in BidiMirroring.txt and
those defined in the OpenType mirroring list, which was frozen as of Unicode 5.1.
Note that this does not change bidirectional formatting: there is no
change to the Bidi_Mirrored binary property value here, but only to the listing
of which pairs of encoded characters have nominally mirroring glyphs.
- Some property values have been added to the Indic_Syllabic_Category property.
The following assignments of Line_Break property values deserve careful review. Implementers and specialists are invited to provide feedback on these assignments.
- U+0C84 KANNADA SIGN SIDDHAM (lb=BB)
- Historic punctuation in the range U+2E43..U+2E4E (mostly lb=BA)
Additionally, implementers should take note of the following special Line_Break
property values associated with a subset of the emoji additions to UCD 11.0:
- New emoji base characters: U+1F9B5, U+1F9B6, U+1F9B8, U+1F9B9 (lb=EB). Note that
most new emoji characters have the value lb=ID.
Unihan-related Issues
All Unihan
properties should be reviewed carefully. Additionally, the following
deserve special attention:
- Additional CJK unified ideographs, which push the end of range for assigned characters in the main CJK block. (The same issue applies for Tangut, which also had a few new ideographs added at the end of the main Tangut block.)
- 5 new provisional Unihan properties have been added.
- In addition, the kHangul property values underwent a major revision.
Standardized Variation Sequences
One additional new standardized variation sequences has been added, to represent a short diagonal stroke form of U+FF10 FULLWIDTH DIGIT ZERO.
Code Charts
As always, careful review of the updated code charts for Version 11.0.0 is advised,
especially for all newly added scripts.
Particular issues to take note of include:
- The use of characters beyond the range of Latin-1 is now allowed in
annotations in the names list. (See NamesList.html for details.) Some
other adaptations have been made in the use of fonts in the names list
part of the code charts.
Collation-related Issues
The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 11.0
repertoire for UCA 11.0. For the most part, the additions for new scripts and other
characters are unremarkable, but implementations should be checked to ensure
the new additions do not cause problems.
Other Issues
Please also check the following specific items carefully:
- The versioning for the emoji data release associated with
UTS #51 was bumped from 5.0 directly to 11.0, to enable a
less confusing synchronization with the UCD proper.
- Data for the 66 new emoji character associated with the
Unicode 11.0 repertoire was officially released on February 7,
in order to meet the long setbacks involved in rolling out
new emoji support. UCD 11.0 beta reviewers should note that
property values for characters that depend in any direct way on the Emoji 11.0
data cannot now be changed for Unicode 11.0, because of
stability requirements.
The following blocks are new in Unicode 11.0.0. Check implementations
carefully for any range or property value assumptions regarding
these new blocks. See also the single-block delta charts.
Range |
Block Name |
1C90..1CBF |
Georgian Extended |
10D00..10D3F |
Hanifi Rohingya |
10F00..10F2F |
Old Sogdian |
10F30..10F6F |
Sogdian |
11800..1184F |
Dogra |
11D60..11DAF |
Gunjala Gondi |
11EE0..11EFF |
Makasar |
16E40..16E9F |
Medefaidrin |
1D2E0..1D2FF |
Mayan Numerals |
1EC70..1ECBF |
Indic Siyaq Numbers |
1FA00..1FA6F |
Chess Symbols |
Some blocks have also had font updates; see the
single-block delta charts for details.
In such cases, careful review of the blocks in question
is advised, to ensure that there have not been any
regressions in representative glyph display.
General Issues
For current proposed updates to the particular UAXes, see
Proposed Updates for Standard Annexes
or use the links in the navigation bar on this page.
Particular issues in the UAXes may also be the focus of specific
Public Review Issues.
Each proposed textual change in a UAX is highlighted, so that you can focus
your review on those sections if you have limited time. The changes
are also listed in detail in the Modifications sections (linked from the table
of contents of each document), and are summarized in
UAX changes,
so you can check on those areas that might be of most
interest.
Some links between beta documents and the proposed
updates for UAXes will not work correctly during the
beta review period. This is a known problem which does
not need to be reported, as such links point to
the eventual final names or revision numbers for the
released versions.
Stability
Certain character properties for newly assigned characters cannot be
changed after the formal release of each version of the standard, because of the
Character Encoding Stability Policy.
Such character property values need special attention during the beta review process, as they
cannot be corrected after publication. These include:
- Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.
- The determination of whether a character is included in identifiers (XID_Start, XID_Continue).
- Case mappings and case foldings.