BETA Unicode 16.0.0

The Unicode Standard

Tech Site | Site Map | Search

BETA Unicode® 16.0.0

Note: The beta review period for Unicode 16.0.0 has closed, as of July 2, 2024. Feedback received during the public review can be referred to from PRI #502. This beta review page is left active, however, for convenience of access to the prepublication versions of the Unicode 16.0.0 data files and annexes, until the formal release planned for September 10, 2024.

The next version of the Unicode Standard will be Version 16.0.0, planned for release on September 10, 2024. This version updates several annexes to deal with segmentation issues and adds significant new repertoire. A total of 5185 new characters are encoded.

A beta version of the 16.0.0 Unicode Character Database files is available for public review. We strongly encourage implementers to review the summary description, download the beta 16.0.0 Unicode Character Database files, and test their programs with the new data, well before the end of the beta period. It is especially important to review the Notable Issues for Beta Reviewers.

We encourage users to check the code charts carefully to verify correctness of the new characters added to Unicode 16.0.0 and to ensure that there are no regressions in glyph shapes for previously encoded characters.

Summary description

Unicode character database (UCD)

Summary of beta charts

Single-block delta charts with yellow highlighting for new characters

Single-block charts for all of Unicode 16.0.0

Code charts - single download (131 MB)

Emoji charts for beta review

Auxiliary HTML charts for beta review

Unicode Standard Annexes (proposed updates)

If a link is not active for an annex, no proposed update is available for review. This situation may occur when no significant change is planned for that annex for a particular release.

UAX #9, Unicode Bidirectional Algorithm

UAX #11, East Asian Width

UAX #14, Unicode Line Breaking Algorithm

UAX #15, Unicode Normalization Forms

UAX #24, Unicode Script Property

UAX #29, Unicode Text Segmentation

UAX #31, Unicode Identifier and Pattern Syntax

UAX #34, Unicode Named Character Sequences

UAX #38, Unicode Han Database (Unihan)

UAX #41, Common References for Unicode Standard Annexes

UAX #42, Unicode Character Database in XML

UAX #44, Unicode Character Database

UAX #45, U-Source Ideographs

UAX #50, Unicode Vertical Text Layout

UAX #53, Unicode Arabic Mark Rendering

UAX #57, Unicode Egyptian Hieroglyph Database (Unikemet)

Related Unicode Technical Standards (proposed updates)

In addition to the Unicode Standard proper, four other Unicode Technical Standards have significant text and data file updates that are correlated with the new additions for Unicode 16.0.0. Review of that text and data is also encouraged during the beta review period.

Specification Data Files

UTS #10, Unicode Collation Algorithm DUCET and test files

UTS #39, Unicode Security Mechanisms Identifier and confusables files

UTS #46, Unicode IDNA Compatibility Processing IDNA mapping and test files
IDNA derived data

UTS #51, Unicode Emoji Emoji data files (in UCD)

Emoji sequences and test files

Review and Feedback

For guidance on how to focus your review, see the section Notable Issues for Beta Reviewers.

Any feedback should be reported using the contact form. Comments on the Unicode Standard Version 16.0.0 or the Unicode Character Database data files should refer to the beta review Public Review Issue #502. Comments on specific Version 16.0.0 UAXes and UTSes should refer to the respective Public Review Issue Numbers for each document, where available.

The comment period ends July 2, 2024. All substantive technical comments must have been received by that date for consideration at the July UTC meeting. Editorial comments (typos, etc.) may be still submitted after that date for consideration in the final editorial work.

Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 16.0.0 is final. It is inappropriate to cite these files as other than a work in progress. No products or implementations should be released based on the beta UCD data files—use only the final, approved Version 16.0.0 data files, expected on September 10, 2024.

The Unicode Consortium provides early access to updated versions of the data files and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of Version 16.0.0.

The assignment of characters for Unicode 16.0.0 is now stable. There will be no further additions or modifications of code points and no further changes to character names. Please do not submit feedback requesting changes to code points or character names for Unicode 16.0.0, as such feedback is not actionable.

One of the main purposes of the beta review period is to verify and correct the preliminary character property assignments in the Unicode Character Database. Reviewers should check for property changes to existing Unicode 15.1.0 characters, as well as the property values for the new Unicode 16.0.0 character additions. The Auxiliary HTML charts include the new characters highlighted in yellow, with names appearing when hovering over a cell. These charts may be useful for reviewing information such as the default collation order, Script property assignments, and so forth during beta review.

To facilitate verification of the property changes and additions during beta review, diffable XML versions of the Unicode Character Database are available. For more information, see the diffs.readme.txt file.

The beta review period is a good opportunity to add support for the new Unicode 16.0.0 characters in internal versions of software, so that software can be tested to verify that the new characters and property assignments do not cause problems when upgraded to Version 16.0.0 of Unicode.

Notable Issues for Beta Reviewers

Changes to Unicode Standard Annexes

Some of the Unicode Standard Annexes have modifications for Unicode 16.0.0, often in coordination with changes to character properties. Most notably for Unicode 16.0.0:

UAX #14: The proposed update includes numerous changes to improve line breaking and handling of numeric expressions, with modifications to various line breaking classes and rules. Changes include updates to the descriptions of line breaking classes to account for rule changes and specific script handling. Sections of the documentation have been moved, modified, or referenced differently for clarity and accuracy, with specific attention to feedback on previous versions of the Unicode Standard.

UAX #24: The proposed update documents a change in format of ScriptExtensions.txt.

UAX #29: The proposed update includes changes to the Grapheme_Cluster_Break property values and Grapheme Cluster Boundary Rules. The definition of GCB=V has been updated to include Kirat Rai vowel signs, and the descriptions of rules GB6–GB8 have been updated to account for the extension of conjoining behaviour beyond Hangul Jamo. Additionally, the definition of SB=STerm has been updated to subtract ATerm, ensuring the classes are disjoint.

UAX #38: The proposed update documents new Unihan character properties and updates the descriptions of others. It also describes the use of a third apostrophe in radical-stroke data to indicate a second non-Chinese simplified radical.

UAX #44: The proposed update includes an important clarification regarding the concept of stability for property value aliases, in relation to XML attributes.

UAX #53: This annex has been newly converted to a UAX, from a specification formerly published as a UTR.

UAX #57: This annex is new for Unicode 16.0. It documents the data formats and interpretation of the new Unikemet.txt data file for Egyptian hieroglyphs.

See the Modifications section of each Annex for details of the relevant changes.

Core Specification Update

The tooling and formatting for the production of the core specification have changed significantly in this version.

The beta review draft core specification is available as per-chapter web pages. This is a first for the beta review period for a new version of the Unicode Standard.

Reviewers should carefully check for inadvertent changes in the text, in particular in glyph examples. However, certain styling choices are not final, for example, whether tables have grid lines or not, or contain empty cells. Please do not comment on table styling, but do comment if you spot any significant errors in table content.

The text still contains a number of editor's notes, indicating both general information for reviewers and spots in the text that are not yet complete for Unicode 16.0. Please use those notes as guidance, as there is no need for repeated feedback reports regarding omissions or defects that the editors already know about and are actively working on.

Normalization Behavior

Several characters have been added in Unicode 16.0 which have subtle implications for certain optimizations of normalization. These do not change the normalization algorithm, but have implications for the derivation and use of Quick_Check properties for optimization of normalization form detection. See the proposed update for UAX #15 for details.

Segmentation Issues

There has been a change of linebreaking for U+2019 RIGHT SINGLE QUOTATION MARK (and similar directional quotation marks) to deal with problems in simplified Chinese linebreaking contexts.

There has been a complex set of linebreaking rule changes. See the proposed update for UAX #14 and L2/24-064 section 5.13 (for Finnish hyphen). See also L2/24-064 section 5.15 (for the 123.abc bug/tailoring issue).

There has also been a change to the Grapheme_Cluster_Break property data, extending the use of GCB=V to apply to certain non-Hangul vowels, and in particular for Kirat Rai vowels. This change finesses the behavior of the segmentation of grapheme cluster breaks in such cases, while respecting normalization requirements and canonical equivalence. Implementations should take note that GCB=V and HST=V are no longer coextensive. See the proposed update of UAX #29 for details.

Script-specific Issues

There are seven new scripts encoded in Unicode 16.0. Some of these scripts, such as Tulu-Tigalari, have complex layout.

There are 3,995 additional Egyptian hieroglyphs, particularly in support of Ptolemaic texts. There is a new data file, Unikemet.txt, with source data, function, and phonetic information for hieroglyphs, including the previously encoded repertoire. See UAX #57, Unicode Egyptian Hieroglyph Database (Unikemet) for details.

General Character Property Issues

The ScriptExtensions.txt has had a format change for 16.0. Each entry is formatted as before, but the overall order of entries has been changed to code point order. To facilitate comparison with the previous version of the standard, a retroactively formatted 15.1 data file has been posted. That puts the Unicode 15.1 scx data into the newly defined order of entries, so that the 15.1 and 16.0 files can be meaningfully diffed.

Numeric Property Issues

There are eight new sets of decimal digits added in Unicode 16.0. Five of these sets are for newly encoded scripts: Garay, Sunuwar, Gurung Khema, Kirat Rai, and Ol Onal. Two sets of digits constitute more region-specific digit sets for the Myanmar script. Finally, there is one additional set, consisting of stylistically outlined digits, intended for support of legacy computer symbol sets for terminal emulations. Implementations of numeric values and numeric formatting should take these new sets into account.

Unihan-related Issues

All Unihan properties should be reviewed carefully. The following changes deserve special attention:

Some kRSUnicode values now include triple-apostrophe radicals.

One old provisional property has been removed.

Two new provisional properties have been added.

See UAX #38 for further details on these changes, especially Section 4.2, Listing by Date of Addition to the Unicode Standard, and Section 4.3, Listing by Location within Unihan.zip. For the triple-apostrophe radicals, see:

UAX #38: kRSUnicode

UAX #38: Section 3.6, Radical-Stroke Counts

UAX #38: Section 2.1.2, Sorting Algorithm Used by the Radical-Stroke Charts

Other CJK-related changes

CJKRadicals.txt: One radical number now ends with three apostrophe characters: 212'''

From the UAX 38 kRSUnicode description: “Three apostrophes after the radical indicates a second non-Chinese simplified version of the given radical.”

This radical is used in kRSUnicode but does not (currently) have data in CJKRadicals.txt. There is no character in the radicals blocks for it, and has a unified ideograph that is outside the original Unicode 1.1 Unihan block (in fact, it has a supplementary code point).

212''' / U+31DE5

Standardized Variation Sequences

Three unused Egyptian hieroglyph variation sequences have been removed from the data.

Eight variation sequences have been added for curly quotation marks (U+2018, U+2019, U+201C, U+201D) to deal with full-width layout considerations in Chinese text.

Code Charts

As always, careful review of the updated code charts for Version 16.0.0 is advised. Particular issues to take note of include:

There are a number of Han glyph updates, particularly for CJK Unified Ideographs Extension B.

Other glyph updates are listed explicitly in the delta charts index page.

There are also a very large number of J-Source (Japanese) additions to the CJK charts. These extensions are not individually highlighted in the code charts.

The two code charts for Egyptian hieroglyphs contain extensive functional and phonetic information derived from the new data file, Unikemet.txt.

Collation-related Issues

The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 16.0.0 repertoire for UCA 16.0.0. For the most part, the additions for new characters are unremarkable, but implementations should be checked to ensure the new additions do not cause problems.

A significant new change for DUCET in Unicode 16.0 involves moving the non-decimal digits to sort after the main decimal digits. This change greatly reduces the superfluous differences between DUCET and the CLDR base tailoring of DUCET.

There has also been a small fix to correct the ordering for U+312C BOPOMOFO LETTER GN.

IDNA-related Issues

There are a number of significant changes for the proposed update for UTS #46 and its associated data files.

The text has been changed to simplify the base exclusion set and adjust the derivation of the mappings in IdnaMappingTable.txt. Previously, the base exclusion set had been derived from differences between IDNA2003 data and the principles of UTS #46. After review, it has been determined that it is no longer necessary to disallow characters on the basis of differences from IDNA2003, so the base exclusion set can be radically simplified. See L2/24-064 section 6.2 for more context.

In Section 4, Processing, if the label starts with “xn--”, and the conversion from Punycode yields either an empty label or an all-ASCII label, then an error is now recorded, consistent with IDNA2008.

In the test data file, there is a small addition to the syntax: "" means an empty string. There are also other test data corrections and improvements. For details see Section 8, Conformance Testing, Migration.

New Data Files

There are two new data files in the UCD:

DoNotEmit.txt. This data file lists characters and sequences of characters that are not generally recommended for emission, for example, by a keyboard input process. In each case, the recommended representation of the entity in question is listed in a separate field in the data file. See the header text in the data file for more explanation.

Unikemet.txt. This data file provides property and other character information in support of Egyptian hieroglyphs. See UAX #57, Unicode Egyptian Hieroglyph Database (Unikemet) for details.

General Issues

For current proposed updates to the particular UAXes, see Proposed Updates for Standard Annexes or use the links in the navigation bar on this page. Particular issues in the UAXes may also be the focus of specific Public Review Issues. Each proposed textual change in a UAX is highlighted, so that you can focus your review on those sections if you have limited time. The changes are also listed in detail in the Modifications sections (linked from the table of contents of each document), and are summarized in UAX changes, so you can check on those areas that might be of most interest.

Some links between beta documents and the proposed updates for UAXes will not work correctly during the beta review period. This is a known problem which does not need to be reported, as such links point to the eventual final names or revision numbers for the released versions.

Stability

Certain character properties for newly assigned characters cannot be changed after the formal release of each version of the standard, because of the Character Encoding Stability Policy. Such character property values need special attention during the beta review process, as they cannot be corrected after publication. These include:

Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.

The determination of whether a character is included in identifiers (XID_Start, XID_Continue).

Case foldings.

There are also strong constraints on additions and changes to case mappings.