BETA Unicode 14.0.0

BETA Unicode® 14.0.0

Note: The beta review period for Unicode 14.0.0 has closed, as of July 13, 2021. Feedback received during the public review can be referred to from PRI #433. This beta review page is left active, however, for convenience of access to the prepublication versions of the Unicode 14.0.0 data files and annexes, until the formal release planned for September 14, 2021.

The next version of the Unicode Standard will be Version 14.0.0, planned for release on September 14, 2021. This version updates several annexes to deal with segmentation issues and adds significant new repertoire. A total of 838 new characters are encoded, including 37 new emoji characters, five new scripts, and multiple additions to existing blocks.

A beta version of the 14.0.0 Unicode Character Database files is available for public review. We strongly encourage implementers to review the summary description, download the beta 14.0.0 Unicode Character Database files, and test their programs with the new data, well before the end of the beta period. It is especially important to review the Notable Issues for Beta Reviewers.

We encourage users to check the code charts carefully to verify correctness of the new characters added to Unicode 14.0.0 and to ensure that there are no regressions in glyph shapes for previously encoded characters.

Summary description

Unicode character database (UCD)

Summary of beta charts

Single-block delta charts with yellow highlighting for new characters

Single-block charts for all of Unicode 14.0.0

Code charts - single download (100 MB)

Emoji charts for beta review

Auxiliary HTML charts for beta review

Related Unicode Technical Standards

In addition to the Unicode Standard proper, four other Unicode Technical Standards have significant text and data file updates that are correlated with the new additions for Unicode 14.0.0. Review of that text and data is also encouraged during the beta review period.

Specification Data Files

UTS #10, Unicode Collation Algorithm DUCET and test files

UTS #39, Unicode Security Mechanisms Identifier and confusables files

UTS #46, Unicode IDNA Compatibility Processing IDNA mapping and test files

UTS #51, Unicode Emoji Emoji data files (in UCD)

Emoji sequences and test files

Review and Feedback

For guidance on how to focus your review, see the section Notable Issues for Beta Reviewers.

Any feedback should be reported using the contact form. Comments on the Unicode Standard Version 14.0.0 or the Unicode Character Database data files should refer to the beta review Public Review Issue #433. Comments on specific Version 14.0.0 UAXes and UTSes should refer to the respective Public Review Issue Numbers for each document, where available.

The comment period ends July 13, 2021. All substantive technical comments must have been received by that date for consideration at the July UTC meeting. Editorial comments (typos, etc.) may be still submitted after that date for consideration in the final editorial work.

Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 14.0.0 is final. It is inappropriate to cite these files as other than a work in progress. No products or implementations should be released based on the beta UCD data files—use only the final, approved Version 14.0.0 data files, expected on September 14, 2021.

The Unicode Consortium provides early access to updated versions of the data files and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of Version 14.0.0.

The assignment of characters for Unicode 14.0.0 is now stable. There will be no further additions or modifications of code points and no further changes to character names. Please do not submit feedback requesting changes to code points or character names for Unicode 14.0.0, as such feedback is not actionable.

One of the main purposes of the beta review period is to verify and correct the preliminary character property assignments in the Unicode Character Database. Reviewers should check for property changes to existing Unicode 13.0.0 characters, as well as the property values for the new Unicode 14.0.0 character additions. The Auxiliary HTML charts include the new characters highlighted in yellow, with names appearing when hovering over a cell. These charts may be useful for reviewing information such as the default collation order, Script property assignments, and so forth during beta review.

To facilitate verification of the property changes and additions, diffable XML versions of the Unicode Character Database are available. These XML files are dated, so that people can check the details of changes that occurred during the beta review period. For more information, see the diffs.readme.txt file.

The beta review period is a good opportunity to add support for the new Unicode 14.0.0 characters in internal versions of software, so that software can be tested to verify that the new characters and property assignments do not cause problems when upgraded to Version 14.0.0 of Unicode.

Notable Issues for Beta Reviewers

Changes to Unicode Standard Annexes

Some of the Unicode Standard Annexes have modifications for Unicode 14.0.0, often in coordination with changes to character properties. Most notably for Unicode 14.0.0:

In UAX #38, Unicode Han Databae (Unihan) there have been significant updates to the descriptions of many data fields.

In UAX #45, U-Source Ideographs, descriptions were added for new data fields (total strokes and first residual stroke) in the associated data file. The KangXi dictionary index field was obsoleted. New information was added about the submission process for U-Source ideographs.

See the Modifications section of each Annex for details of the relevant changes.

Core Specification Update

The core specification is undergoing extensive review, with numerous additions for Version 14.0.0. Although the draft text for Version 14.0.0 is not yet available, specific reports of any technical or editorial issues in the currently published core specification are also welcome during the beta review period. Such reports will be taken into consideration for corrections to the Version 14.0.0 draft. (Note: The Unicode Consortium has ongoing opportunities for subject-matter volunteers: experts interested in contributing to or editing relevant parts of the core specification or other Unicode specifications.)

Script-specific Issues

Five new scripts have been added in Unicode 14.0.0. Some of these scripts have particular attributes which may cause issues for implementations. The more important of these attributes are summarized here.

Old Uyghur is an abjad, historically related to Sogdian. Representation of Old Uyghur text poses many significant issues. See the original proposal documentation in L2/20-191 for an extensive discussion.

Casing Issues

Four new Latin case pairs and one new Glagolitic case pair have been added in Version 14.0.0. In addition, one of the newly added scripts, Vithkuqi, is a bicameral script with casing. Implementations of case mapping and case folding should be checked to ensure they account correctly for the new case pairs.

Numeric Property Issues

A new set of decimal digits has been added for the Tangsa script. See U+16AC0..U+16AC9. Implementations of digits will need to take those into account.

Unihan-related Issues

All Unihan properties should be reviewed carefully. Additionally, the following deserve special attention:

A new provisional property, kStrange, has been added to Unihan. This property is documented in detail in a new Unicode Technical Note, UTN #43.

The provisional kCantonese property was extensively refined. This work included 6,000 additional property values, as well as changing the property values for nearly 5,000 existing ideographs to reflect only one reading.

Over 1,000 kIRG_VSource property values with "VU-"" prefix were changed to use the "VN-" prefix.

See UAX #38 for further details on these changes, especially Section 4.2, Listing by Date of Addition to the Unicode Standard, and Section 4.3, Listing by Location within Unihan.zip. UAX #38 also has updated regex values for numerous Unihan properties.

Code Charts

As always, careful review of the updated code charts for Version 14.0.0 is advised, especially for all newly added scripts. Particular issues to take note of include:

There was a significant update in the fonts used for many CJK auxiliary blocks, to improve the design and consistency of glyphs. Details of the affected ranges of glyphs can be found in the Glyph and Variation Sequence Changes table on the single block delta charts page.

There have also been systematic updates to many glyphs in the Egyptian Hieroglyphs block, to more accurately reflect current practice.

Collation-related Issues

The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 14.0.0 repertoire for UCA 14.0. For the most part, the additions for new scripts and other characters are unremarkable, but implementations should be checked to ensure the new additions do not cause problems.

Other Issues

Please also check the following specific items carefully:

37 new emoji characters have been added. However, in addition to those individual characters, many new emoji sequences have been recognized, as well. If your implementation supports emoji, be sure to carefully review UTS #51, Unicode Emoji (PRI #430).

WARNING: There are changes to the ends of three existing CJK unified ideograph ranges in Unicode 14.0.0. Because implementations often hard-code ideographic ranges to short-cut lookups and reduce table sizes, it is especially important that implementers pay close attention to the implications of range changes for Version 14.0.0. These extensions bump up the end ranges of the encoded ideographs by a few code points within each block:

3 code points for the URO: ending at U+9FFF [fills the block]

2 code points for Extension B: ending at U+2A6DF [fills the block]

4 code points for Extension C: ending at U+2B738

See Section 4.4, Listing of Characters Covered by the Unihan Database in UAX #38 for the version history of all these small CJK unified ideograph additions inside existing blocks.

The following blocks are new in Unicode 14.0.0. Check implementations carefully for any range or property value assumptions regarding these new blocks. See also the single-block delta charts.

Range Block Name

0870..089F Arabic Extended-B

10570..105BF Vithkuqi

10780..107BF Latin Extended-F

10F70..10FAF Old Uyghur

11AB0..11ABF Unified Canadian Aboriginal Syllabics Extended-A

12F90..12FFF Cypro-Minoan

16A70..16ACF Tangsa

1AFF0..1AFFF Kana Extended-B

1CF00..1CFCF Znamenny Musical Notation

1DF00..1DFFF Latin Extended-G

1E290..1E2BF Toto

1E7E0..1E7FF Ethiopic Extended-B

In addition to the new blocks, two existing blocks had slight adjustments to their end ranges. The Ahom block range was extended by one column to end at U+1174F, instead of U+1173F. And the block range for the Tangut Supplement block was changed to end at U+18D7F, corrected from the erroneous value of U+18D8F published in Unicode 13.0. Implementations should be checked carefully for any hard-coded assumptions about the end ranges of existing blocks.

Some blocks have also had font updates; see the single-block delta charts for details. In such cases, careful review of the blocks in question is advised, to ensure that there have not been any regressions in representative glyph display.

General Issues

For current proposed updates to the particular UAXes, see Proposed Updates for Standard Annexes or use the links in the navigation bar on this page. Particular issues in the UAXes may also be the focus of specific Public Review Issues. Each proposed textual change in a UAX is highlighted, so that you can focus your review on those sections if you have limited time. The changes are also listed in detail in the Modifications sections (linked from the table of contents of each document), and are summarized in UAX changes, so you can check on those areas that might be of most interest.

Some links between beta documents and the proposed updates for UAXes will not work correctly during the beta review period. This is a known problem which does not need to be reported, as such links point to the eventual final names or revision numbers for the released versions.

Stability

Certain character properties for newly assigned characters cannot be changed after the formal release of each version of the standard, because of the Character Encoding Stability Policy. Such character property values need special attention during the beta review process, as they cannot be corrected after publication. These include:

Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.

The determination of whether a character is included in identifiers (XID_Start, XID_Continue).

Case mappings and case foldings.