BETA Unicode 11.0.0

BETA Unicode® 11.0.0

Note: The beta review period for Unicode 11.0.0 has closed, as of April 23, 2018. Feedback received during the public review can be referred to from PRI #372. This beta review page is left active, however, for convenience of access to the prepublication versions of the Unicode 11.0.0 data files and annexes, until the formal release planned for June 5, 2018.

The next version of the Unicode Standard will be Version 11.0.0, planned for release on June 5, 2018. This version updates several annexes to deal with segmentation issues and adds significant new repertoire. A total of 684 new characters are encoded, including 66 new emoji characters, 7 new scripts, and multiple additions to existing blocks.

A beta version of the 11.0.0 Unicode Character Database files is available for public review. We strongly encourage implementers to review the summary description, download the beta 11.0.0 Unicode Character Database files, and test their programs with the new data, well before the end of the beta period. It is especially important to review the Notable Issues for Beta Reviewers.

We encourage users to check the code charts carefully to verify correctness of the new characters added to Unicode 11.0.0 and to ensure that there are no regressions in glyph shapes for previously encoded characters.

Summary description

Unicode character database (UCD)

Summary of beta charts

Single-block delta charts with yellow highlighting for new characters

Single-block charts for all of Unicode 11.0.0

Code charts - single download (108 MB)

Auxiliary HTML charts for beta review

Related Unicode Technical Standards

In addition to the Unicode Standard proper, four other Unicode Technical Standards have significant text and data file updates that are correlated with the new additions for Unicode 11.0.0. Review of that text and data is also encouraged during the beta review period.

UTS #10, Unicode Collation Algorithm Data files

UTS #39, Unicode Security Mechanisms Data files

UTS #46, Unicode IDNA Compatibility Processing Data files

UTS #51, Unicode Emoji Data files

Review and Feedback

For guidance on how to focus your review, see the section Notable Issues for Beta Reviewers.

Any feedback should be reported using the contact form. Comments on the Unicode Standard Version 11.0.0 or the Unicode Character Database data files should refer to the beta review Public Review Issue #372. Comments on specific Version 11.0.0 UAXes and UTSes should refer to the respective Public Review Issue Numbers for each document, where available.

The comment period ends April 23, 2018. All substantive technical comments must have been received by that date for consideration at the May UTC meeting. Editorial comments (typos, etc.) may be still submitted after that date for consideration in the final editorial work.

Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 11.0.0 is final. It is inappropriate to cite these files as other than a work in progress. No products or implementations should be released based on the beta UCD data files—use only the final, approved Version 11.0.0 data files, expected on June 5, 2018.

The Unicode Consortium provides early access to updated versions of the data files and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of Version 11.0.0.

The assignment of characters for Unicode 11.0.0 is now stable. There will be no further additions or modifications of code points and no further changes to character names. Please do not submit feedback requesting changes to code points or character names for Unicode 11.0.0, as such feedback is not actionable.

One of the main purposes of the beta review period is to verify and correct the preliminary character property assignments in the Unicode Character Database. Reviewers should check for property changes to existing Unicode 10.0.0 characters, as well as the property values for the new Unicode 11.0.0 character additions. The Auxiliary HTML charts include the new characters highlighted in yellow, with names appearing when hovering over a cell. These charts may be useful for reviewing information such as the default collation order, Script property assignments, and so forth during beta review.

To facilitate verification of the property changes and additions, diffable XML versions of the Unicode Character Database are available. These XML files are dated, so that people can check the details of changes that occurred during the beta review period. For more information, see the diffs.readme.txt file.

The beta review period is a good opportunity to add support for the new Unicode 11.0.0 characters in internal versions of software, so that software can be tested to verify that the new characters and property assignments do not cause problems when upgraded to Version 11.0.0 of Unicode.

Notable Issues for Beta Reviewers

Changes to Unicode Standard Annexes

Some of the Unicode Standard Annexes have modifications for Unicode 11.0.0, often in coordination with changes to character properties. Most notably for Unicode 11.0.0:

UAX #29 handling of grapheme cluster boundary determination has undergone a significant update, to better handle consonants linked by viramas, so as to provide better segmentation of Indic phonological syllables. Implementers of segmentation should carefully check their property classes and rules.

See the Modifications section of each Annex for details of the relevant changes.

Core Specification Update

The core specification is undergoing extensive review, with numerous additions for Version 11.0.0. Although the draft text for Version 11.0.0 is not yet available, specific reports of any technical or editorial issues in the currently published core specification are also welcome during the beta review period. Such reports will be taken into consideration for corrections to the Version 11.0.0 draft. (Note: The Unicode Consortium has ongoing opportunities for subject-matter volunteers: experts interested in contributing to or editing relevant parts of the core specification or other Unicode specifications.)

Script-specific Issues

7 new scripts have been added in Unicode 11.0.0. Some of these scripts have particular attributes which may cause issues for implementations. The more important of these attributes are summarized here.

The Hanifi Rohingya script is a new RTL script, with numbers written LTR, as in Arabic.

The tatweel (U+0640) has been extended for use in Hanifi Rohingya and Sogdian.

There are two new sets of vigesimal (base 20) numerals, one for the Medefaidrin script, and another for Mayan. The Mayan numerals are added for specialty use, as for page numbers, in advance of the encoding of the full Mayan script.

Indic Siyaq numerals have complex formatting requirements, when combined to represent large numbers.

New Data Files Added to the UCD

A new data file has been added to the UCD: EquivalentUnifiedIdeograph.txt. That data file contains the mapping values for the new property, Equivalent_Unified_Ideograph (EqUIdeo).

Casing Issues

There has been a very significant change to casing behavior for the Georgian script. A new set of Mtavruli capital letters (U+1C90..U+1CBA, U+1CBD..U+1CBF) has been added to Unicode 11.0.0, with case mappings to the existing Mkhedruli letters (U+10D0..U+10FA, U+10FD..U+10FF). In prior versions of the Unicode Standard, Mkhedruli Georgian was considered to be a monocameral (non-casing) script, and the Mkhedruli Georgian letters were gc=Lo. Starting with Version 11.0.0, those Mkhedruli Georgian letters are now gc=Ll, and have uppercase mappings to Mtavruli Georgian capital letters. This change will have major implications for Georgian implementations, including changes for input methods, fonts, casing, and string matching. Existing implementations have treated Mtavruli headlines and other uses for textual emphasis as a text style, so there will also be significant issues for document conversion and upgrade.

Another complication for Georgian is that the primary orthography does not use titlecasing, and the Mkhedruli Georgian letters do not have titlecase mappings to Mtavruli letters. This is unique among bicameral systems in the Unicode Standard, so casing implementations should be prepared for this exception.

General Character Property Issues

There are a number of issues related to particular character properties:

New GCB and WB segmentation property values for the revised algorithms to better handle Indic phonological syllables (aksaras). (See also UAX #29.) A couple of emoji-related property values are no longer used for segmentation, as a consequence of the changes in UAX #29.

GCB=Extend no longer matches Grapheme_Extend=Y, as a result of its partitioning to factor out a new class, GCB=Virama. WB=Extend and SB=Extend are unaffected.

In prior versions of the UCD, cursive joining scripts which had any Joining_Group values assigned included distinct values for all characters that participate in cursive joining, including all of the Joining_Group singletons (classes containing only a single character). Starting with Unicode 11.0.0 and going forward, explicit Joining_Group values are assigned only to characters which do not constitute singleton classes. This new convention is applicable to the two newly encoded cursive joining scripts: Hanifi Rohingya and Sogdian. Implementations may need to take into account this discontinuity in how Joining_Group values are assigned to cursive joining scripts.

Bidi mirroring: Unicode 11.0.0 now adds formal recognition of a number of previously encoded mathematical characters as forming mirroring pairs. This means that there is now a further deviation between the mappings defined in BidiMirroring.txt and those defined in the OpenType mirroring list, which was frozen as of Unicode 5.1. Note that this does not change bidirectional formatting: there is no change to the Bidi_Mirrored binary property value here, but only to the listing of which pairs of encoded characters have nominally mirroring glyphs.

Some property values have been added to the Indic_Syllabic_Category property.

The following assignments of Line_Break property values deserve careful review. Implementers and specialists are invited to provide feedback on these assignments.

U+0C84 KANNADA SIGN SIDDHAM (lb=BB)

Historic punctuation in the range U+2E43..U+2E4E (mostly lb=BA)

Additionally, implementers should take note of the following special Line_Break property values associated with a subset of the emoji additions to UCD 11.0:

New emoji base characters: U+1F9B5, U+1F9B6, U+1F9B8, U+1F9B9 (lb=EB). Note that most new emoji characters have the value lb=ID.

Unihan-related Issues

All Unihan properties should be reviewed carefully. Additionally, the following deserve special attention:

Additional CJK unified ideographs, which push the end of range for assigned characters in the main CJK block. (The same issue applies for Tangut, which also had a few new ideographs added at the end of the main Tangut block.)

5 new provisional Unihan properties have been added.

In addition, the kHangul property values underwent a major revision.

Standardized Variation Sequences

One additional new standardized variation sequences has been added, to represent a short diagonal stroke form of U+FF10 FULLWIDTH DIGIT ZERO.

Code Charts

As always, careful review of the updated code charts for Version 11.0.0 is advised, especially for all newly added scripts. Particular issues to take note of include:

The use of characters beyond the range of Latin-1 is now allowed in annotations in the names list. (See NamesList.html for details.) Some other adaptations have been made in the use of fonts in the names list part of the code charts.

Collation-related Issues

The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 11.0 repertoire for UCA 11.0. For the most part, the additions for new scripts and other characters are unremarkable, but implementations should be checked to ensure the new additions do not cause problems.

Other Issues

Please also check the following specific items carefully:

The versioning for the emoji data release associated with UTS #51 was bumped from 5.0 directly to 11.0, to enable a less confusing synchronization with the UCD proper.

Data for the 66 new emoji character associated with the Unicode 11.0 repertoire was officially released on February 7, in order to meet the long setbacks involved in rolling out new emoji support. UCD 11.0 beta reviewers should note that property values for characters that depend in any direct way on the Emoji 11.0 data cannot now be changed for Unicode 11.0, because of stability requirements.

The following blocks are new in Unicode 11.0.0. Check implementations carefully for any range or property value assumptions regarding these new blocks. See also the single-block delta charts.

Range Block Name

1C90..1CBF Georgian Extended

10D00..10D3F Hanifi Rohingya

10F00..10F2F Old Sogdian

10F30..10F6F Sogdian

11800..1184F Dogra

11D60..11DAF Gunjala Gondi

11EE0..11EFF Makasar

16E40..16E9F Medefaidrin

1D2E0..1D2FF Mayan Numerals

1EC70..1ECBF Indic Siyaq Numbers

1FA00..1FA6F Chess Symbols

Some blocks have also had font updates; see the single-block delta charts for details. In such cases, careful review of the blocks in question is advised, to ensure that there have not been any regressions in representative glyph display.

General Issues

For current proposed updates to the particular UAXes, see Proposed Updates for Standard Annexes or use the links in the navigation bar on this page. Particular issues in the UAXes may also be the focus of specific Public Review Issues. Each proposed textual change in a UAX is highlighted, so that you can focus your review on those sections if you have limited time. The changes are also listed in detail in the Modifications sections (linked from the table of contents of each document), and are summarized in UAX changes, so you can check on those areas that might be of most interest.

Some links between beta documents and the proposed updates for UAXes will not work correctly during the beta review period. This is a known problem which does not need to be reported, as such links point to the eventual final names or revision numbers for the released versions.

Stability

Certain character properties for newly assigned characters cannot be changed after the formal release of each version of the standard, because of the Character Encoding Stability Policy. Such character property values need special attention during the beta review process, as they cannot be corrected after publication. These include:

Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.

The determination of whether a character is included in identifiers (XID_Start, XID_Continue).

Case mappings and case foldings.