BETA Unicode 15.1.0

BETA Unicode® 15.1.0

Note: The beta review period for Unicode 15.1.0 has closed, as of July 4, 2023. Feedback received during the public review can be referred to from PRI #480. This beta review page is left active, however, for convenience of access to the prepublication versions of the Unicode 15.1.0 data files and annexes, until the formal release planned for September 12, 2023.

The next version of the Unicode Standard will be Version 15.1.0, planned for release on September 12, 2023. This version updates several annexes to deal with segmentation issues and adds significant new repertoire. A total of 608 new characters are encoded.

A beta version of the 15.1.0 Unicode Character Database files is available for public review. We strongly encourage implementers to review the summary description, download the beta 15.1.0 Unicode Character Database files, and test their programs with the new data, well before the end of the beta period. It is especially important to review the Notable Issues for Beta Reviewers.

We encourage users to check the code charts carefully to verify correctness of the new characters added to Unicode 15.1.0 and to ensure that there are no regressions in glyph shapes for previously encoded characters.

Summary description

Unicode character database (UCD)

Summary of beta charts

Single-block delta charts with yellow highlighting for new characters

Single-block charts for all of Unicode 15.1.0

Code charts - single download (109 MB)

Emoji charts for beta review

Auxiliary HTML charts for beta review

Related Unicode Technical Standards

In addition to the Unicode Standard proper, four other Unicode Technical Standards have significant text and data file updates that are correlated with the new additions for Unicode 15.1.0. Review of that text and data is also encouraged during the beta review period.

Specification Data Files

UTS #10, Unicode Collation Algorithm DUCET and test files

UTS #39, Unicode Security Mechanisms Identifier and confusables files

UTS #46, Unicode IDNA Compatibility Processing IDNA mapping and test files

UTS #51, Unicode Emoji Emoji data files (in UCD)

Emoji sequences and test files

Review and Feedback

For guidance on how to focus your review, see the section Notable Issues for Beta Reviewers.

Any feedback should be reported using the contact form. Comments on the Unicode Standard Version 15.1.0 or the Unicode Character Database data files should refer to the beta review Public Review Issue #480. Comments on specific Version 15.1.0 UAXes and UTSes should refer to the respective Public Review Issue Numbers for each document, where available.

The comment period ends July 4, 2023. All substantive technical comments must have been received by that date for consideration at the July UTC meeting. Editorial comments (typos, etc.) may be still submitted after that date for consideration in the final editorial work.

Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 15.1.0 is final. It is inappropriate to cite these files as other than a work in progress. No products or implementations should be released based on the beta UCD data files—use only the final, approved Version 15.1.0 data files, expected on September 12, 2023.

The Unicode Consortium provides early access to updated versions of the data files and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of Version 15.1.0.

The assignment of characters for Unicode 15.1.0 is now stable. There will be no further additions or modifications of code points and no further changes to character names. Please do not submit feedback requesting changes to code points or character names for Unicode 15.1.0, as such feedback is not actionable.

One of the main purposes of the beta review period is to verify and correct the preliminary character property assignments in the Unicode Character Database. Reviewers should check for property changes to existing Unicode 15.0.0 characters, as well as the property values for the new Unicode 15.1.0 character additions. The Auxiliary HTML charts include the new characters highlighted in yellow, with names appearing when hovering over a cell. These charts may be useful for reviewing information such as the default collation order, Script property assignments, and so forth during beta review.

The beta review period is a good opportunity to add support for the new Unicode 15.1.0 characters in internal versions of software, so that software can be tested to verify that the new characters and property assignments do not cause problems when upgraded to Version 15.1.0 of Unicode.

Notable Issues for Beta Reviewers

Changes to Unicode Standard Annexes

Some of the Unicode Standard Annexes have modifications for Unicode 15.1.0, often in coordination with changes to character properties. Most notably for Unicode 15.1.0:

There has been a major update to UAX #14, Unicode Line Breaking Algorithm, to provide better default line breaking behavior for orthographic syllables in a significant number of scripts of Southeast Asia.

See the Modifications section of each Annex for details of the relevant changes.

Core Specification Update

Note that there is no update of the core specification planned for this minor release.

General Character Property Issues

There are a number of issues related to particular character properties:

There are 5 new ideographic description characters. These extend the syntax of ideographic description sequences.

Two of the new ideographic description characters function as unary operators, which necessitated introduction of a new binary property: IDS_Unary_Operator.

There are two new properties, ID_Compat_Math_Start and ID_Compat_Math_Continue, for the new Mathematical Compatibility Notation Profile in UAX #31.

There is a new property NFKC_Simple_Casefold which establishes another normalization form like NFKC_Casefold does. The new one uses Simple_Case_Folding mappings rather than full Case_Folding mappings. This is intended for use in systems that support case-insensitive identifiers based on simple (1:1) case folding mappings.

Five new values have been added to the Line_Break property, in support of new orthographic line breaking rules for a significant number of Southeast Asian scripts, as well as Brahmi.

Numeric Property Issues

There are several numeric issues among CJK ideographs to check in implementations.

There is one large new value in extracted/DerivedNumericValues.txt: 10000000000000000 (for U+4EAC)

U+5146 has two kPrimaryNumeric values: 1000000, 1000000000000

U+79ED has two kPrimaryNumeric values: 1000000000, 1000000000000

Unihan-related Issues

All Unihan properties should be reviewed carefully. Note that the Unihan Database is currently frozen for 15.1.0. Beta feedback on Unihan properties will be dealt with in a future release. The following changes deserve special attention:

A new CJK unified ideograph block, Extension I, has been added, with 603 characters in the range U+2EBF0..U+2EE4A. Implementaters should check carefully for any hard-coded assumptions about CJK ranges. Note that to keep the CJK block ranges as compact as possible, Extension I has been added to Plane 2, instead of directly after Extension H on Plane 3. Implementers should check that their code does not assume that CJK extensions all occur in alphabetic order by the extension letter.

Some kRSUnicode values now include double-apostrophe radicals, sometimes as the only values for a code point.

Seven old provisional properties have been removed.

Six new provisional properties have been added.

See UAX #38 for further details on these changes, especially Section 4.2, Listing by Date of Addition to the Unicode Standard, and Section 4.3, Listing by Location within Unihan.zip. UAX #38 also has updated regex values for numerous Unihan properties. For the double-apostrophe radicals, see:

UAX #38: kRSUnicode

UAX #38: Section 3.6, Radical-Stroke Counts

UAX #38: Section 2.1.2, Sorting Algorithm Used by the Radical-Stroke Charts

Other CJK-related changes

CJKRadicals.txt: Several radical numbers now end with two apostrophe characters. Example: 213''; 2EF2; 4E80

From the UAX 38 kRSUnicode docs: “Two apostrophes (") after the radical indicates a non-Chinese simplified version of the given radical.”

Two such radicals are used in kRSUnicode but do not (currently) have data in CJKRadicals.txt. There are no characters in the radicals blocks for them. And one is the first to have a unified ideograph that is outside the original Unicode 1.1 Unihan block (in fact, it has a supplementary code point).

182'' / U+322C4

208'' / U+9F21

Code Charts

As always, careful review of the updated code charts for Version 15.1.0 is advised. Particular issues to take note of include:

The code charts for the main CJK Unified Ideographs block (U+4E00) has an updated format that uses 7 columns for source glyphs, instead of 6. The KP source glyphs have been explicitly added to the code charts.

The font used for the representative glyphs of the Alchemical Symbols block has been updated.

Collation-related Issues

The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 15.1.0 repertoire for UCA 15.1.0. For the most part, the additions for new characters are unremarkable, but implementations should be checked to ensure the new additions do not cause problems.

There has been an additional update to DUCET regarding the weighting of quotation marks. Various single quotation marks are now weighted as secondary variants of U+0027 APOSTROPHE, and various double quotation marks are now weighted as secondary variants of U+0022 QUOTATION MARK. U+05F3 HEBREW PUNCTUATION GERESH is also weighted as a secondary variant of U+0027, and U+05F4 HEBREW PUNCTUATION GERSHAYIM is weighted as a secondary variant of U+0022. This change enables better behavior of geresh and gershayim for searching and sorting, and brings UCA more in line with the CLDR tailorings for quotation marks, geresh, and gershayim.

Other Issues

Please also check the following specific items carefully:

Transitional processing (see conformance clause C1) has now been deprecated in UTS #46, Unicode IDNA Compatibility Processing.

General Issues

For current proposed updates to the particular UAXes, see Proposed Updates for Standard Annexes or use the links in the navigation bar on this page. Particular issues in the UAXes may also be the focus of specific Public Review Issues. Each proposed textual change in a UAX is highlighted, so that you can focus your review on those sections if you have limited time. The changes are also listed in detail in the Modifications sections (linked from the table of contents of each document), and are summarized in UAX changes, so you can check on those areas that might be of most interest.

Some links between beta documents and the proposed updates for UAXes will not work correctly during the beta review period. This is a known problem which does not need to be reported, as such links point to the eventual final names or revision numbers for the released versions.

Stability

Certain character properties for newly assigned characters cannot be changed after the formal release of each version of the standard, because of the Character Encoding Stability Policy. Such character property values need special attention during the beta review process, as they cannot be corrected after publication. These include:

Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.

The determination of whether a character is included in identifiers (XID_Start, XID_Continue).

Case mappings and case foldings.