BETA Unicode 13.0.0

BETA Unicode® 13.0.0

Note: The beta review period for Unicode 13.0.0 has closed, as of January 6, 2020. Feedback received during the public review can be referred to from PRI #412. This beta review page is left active, however, for convenience of access to the prepublication versions of the Unicode 13.0.0 data files and annexes, until the formal release planned for March 10, 2020.

The next version of the Unicode Standard will be Version 13.0.0, planned for release on March 10, 2020. This version updates several annexes to deal with segmentation issues and adds significant new repertoire. A total of 5,930 new characters are encoded, including 55 new emoji characters, four new scripts, and multiple additions to existing blocks.

A beta version of the 13.0.0 Unicode Character Database files is available for public review. We strongly encourage implementers to review the summary description, download the beta 13.0.0 Unicode Character Database files, and test their programs with the new data, well before the end of the beta period. It is especially important to review the Notable Issues for Beta Reviewers.

We encourage users to check the code charts carefully to verify correctness of the new characters added to Unicode 13.0.0 and to ensure that there are no regressions in glyph shapes for previously encoded characters.

Summary description

Unicode character database (UCD)

Summary of beta charts

Single-block delta charts with yellow highlighting for new characters

Single-block charts for all of Unicode 13.0.0

Code charts - single download (110 MB)

Auxiliary HTML charts for beta review [not yet available]

Related Unicode Technical Standards

In addition to the Unicode Standard proper, four other Unicode Technical Standards have significant text and data file updates that are correlated with the new additions for Unicode 13.0.0. Review of that text and data is also encouraged during the beta review period.

Specification Data Files

UTS #10, Unicode Collation Algorithm DUCET and test files

UTS #39, Unicode Security Mechanisms Identifier and confusables files

UTS #46, Unicode IDNA Compatibility Processing IDNA mapping and test files

UTS #51, Unicode Emoji Emoji data files (in UCD)

Emoji sequence and test files

Review and Feedback

For guidance on how to focus your review, see the section Notable Issues for Beta Reviewers.

Any feedback should be reported using the contact form. Comments on the Unicode Standard Version 13.0.0 or the Unicode Character Database data files should refer to the beta review Public Review Issue #412. Comments on specific Version 13.0.0 UAXes and UTSes should refer to the respective Public Review Issue Numbers for each document, where available.

The comment period ends January 6, 2020. All substantive technical comments must have been received by that date for consideration at the January UTC meeting. Editorial comments (typos, etc.) may be still submitted after that date for consideration in the final editorial work.

Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 13.0.0 is final. It is inappropriate to cite these files as other than a work in progress. No products or implementations should be released based on the beta UCD data files—use only the final, approved Version 13.0.0 data files, expected on March 10, 2020.

The Unicode Consortium provides early access to updated versions of the data files and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of Version 13.0.0.

The assignment of characters for Unicode 13.0.0 is now stable. There will be no further additions or modifications of code points and no further changes to character names. Please do not submit feedback requesting changes to code points or character names for Unicode 13.0.0, as such feedback is not actionable.

One of the main purposes of the beta review period is to verify and correct the preliminary character property assignments in the Unicode Character Database. Reviewers should check for property changes to existing Unicode 12.1.0 characters, as well as the property values for the new Unicode 13.0.0 character additions. The Auxiliary HTML charts [not yet available] include the new characters highlighted in yellow, with names appearing when hovering over a cell. These charts may be useful for reviewing information such as the default collation order, Script property assignments, and so forth during beta review.

To facilitate verification of the property changes and additions, diffable XML versions of the Unicode Character Database are available. These XML files are dated, so that people can check the details of changes that occurred during the beta review period. For more information, see the diffs.readme.txt file.

The beta review period is a good opportunity to add support for the new Unicode 13.0.0 characters in internal versions of software, so that software can be tested to verify that the new characters and property assignments do not cause problems when upgraded to Version 13.0.0 of Unicode.

Notable Issues for Beta Reviewers

Changes to Unicode Standard Annexes

Some of the Unicode Standard Annexes have modifications for Unicode 13.0.0, often in coordination with changes to character properties. Most notably for Unicode 13.0.0:

UAX #14, Unicode Line Breaking Algorithm has significant changes for a couple of rules. LB22 was changed to disallow breaking before ellipsis. LB20 was changed to better account for break opportunities around East Asian opening and closing delimiters.

UAX #38, Unicode Han Database (Unihan) has significant updates to document new properties, and to correct regular expressions for many others.

See the Modifications section of each Annex for details of the relevant changes.

Core Specification Update

The core specification is undergoing extensive review, with numerous additions for Version 13.0.0. Although the draft text for Version 13.0.0 is not yet available, specific reports of any technical or editorial issues in the currently published core specification are also welcome during the beta review period. Such reports will be taken into consideration for corrections to the Version 13.0.0 draft. (Note: The Unicode Consortium has ongoing opportunities for subject-matter volunteers: experts interested in contributing to or editing relevant parts of the core specification or other Unicode specifications.)

Script-specific Issues

Four new scripts have been added in Unicode 13.0.0. Some of these scripts have particular attributes which may cause issues for implementations. The more important of these attributes are summarized here.

Dives Akuru is a complex script of the Indic type.

Khitan Small Script has rules for stacking characters into phonogram clusters. One new, Khitan-specific format control character is used to distinguish between two patterns for phonogram clusters. And the Khitan Small Script is traditionally laid out in vertical orientation.

New Data Files Added to the UCD

WARNING: Two of the emoji data files have been formally incorporated into the UCD for Version 13.0.0. These files are located in a new emoji/ subdirectory of the main ucd/ directory. See UTS #51 and UAX #44 for details.

emoji-data.txt specifies six emoji-related binary properties, which assist in the identification and parsing of emoji, and which are relevant to Unicode segmentation algorithms.

emoji-variation-sequences.txt specifies the emoji variation sequences, which enable control of emoji presentation versus text presentation of emoji characters. The format of this file is the same as that used for StandardizedVariants.txt.

Other data files related to emoji sequences, as well as the emoji test file, are located in the /Public/emoji/13.0/ directory associated with UTS #51. Implementations should be prepared to adapt to the new locations of some data files.

There have been no significant changes to the format of any of the normative data content of the emoji data files; however, in the comment section of the data lines, emoji version information has replaced the Unicode version information associated with characters and sequences.

Casing Issues

Only three new Latin case pairs have been added in Version 13.0.0, and there are no changes for casing in other scripts. However, implementations of case mapping and case folding should be checked to ensure they account correctly for the new case pairs.

General Character Property Issues

There are a number of issues related to particular character properties:

A new Canonical_Combining_Class value of ccc=6 has been added for two Vietnamese Han reading marks. Implementations should be checked to ensure that their handling of combining class values does not fail when encountering this new value.

A new value of the Indic_Positional_Category property has been added: Top_And_Bottom_And_Left.

Numeric Property Issues

A new set of decimal digits has been added for the Dives Akuru script.

A new set of compatibility decimal digits has been added, for segmented (LED-like) digit display support for legacy computer graphic symbol sets.

No characters with unusual fractional numeric values or very large integer values have been added in this version.

Unihan-related Issues

All Unihan properties should be reviewed carefully. Additionally, the following deserve special attention:

Three obsolete provisional properties have been removed: kRSJapanese, kRSKanWa, kRSKorean.

Two new normative source properties have been added: kIRG_SSource, kIRG_UKSource. with values split off from kIRG_USource. These properties involve data for the CJK charts and have some impact on the distribution of sources in those charts.

A new informative property has been added: kUnihanCore2020. This is intended as a more useful indicator of the basic Han set to support, superseding the function of kIICore.

WARNING: One informative property, kTotalStrokes, has been moved from the Unihan subfile Unihan_DictionaryLikeData.txt to the subfile Unihan_IRGSources.txt. This change may impact implementations that parse for that particular Unihan property value.

There are large changes in the values for kSimplifiedVariant, kTraditionalVariant, and kZVariant, and many additions for the new kSpoofingVariant property.

See UAX #38 for further details on these changes, especially Section 4.2, Listing by Date of Addition to the Unicode Standard, and Section 4.3, Listing by Location within Unihan.zip. UAX #38 also has updated regex values for numerous Unihan properties.

Standardized Variation Sequences

Two new standardized variation sequences were added to emoji-variation-sequences.txt to distinguish text presentation and emoji presentation forms of U+26A7 MALE WITH STROKE AND MALE AND FEMALE SIGN. This results from the new use of U+26A7 in an emoji sequence defined for Version 13.0.0.

Code Charts

As always, careful review of the updated code charts for Version 13.0.0 is advised, especially for all newly added scripts. Particular issues to take note of include:

The font for the Kangxi Radicals and CJK Radicals Supplement blocks has been updated, so that it more accurately represents the actual forms of Kangxi radicals and the variant radicals. This new font is also used for the indexing radical shown in the CJK unified ideograph blocks in the code charts, as well as in the updated radical-stroke indexes for Version 13.0.0.

The format for the Mongolian code chart has been substantially revised, removing all details about positional variants and standardized variation sequences. The old format, showing all the variant glyphs, is preserved in UTR #54, Unicode Mongolian 12.1 Baseline. Note that future updates to the Mongolian model and the rules for rendering and interpretation of variation sequences, will be worked out in a separate specification, instead of being documented in the basic code chart for Mongolian.

Collation-related Issues

The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 13.0.0 repertoire for UCA 13.0. For the most part, the additions for new scripts and other characters are unremarkable, but implementations should be checked to ensure the new additions do not cause problems.

The following issue is of particular note for collation implementations that parse allkeys.txt:

Because of the addition of a second, non-contiguous range of Tangut ideographs to the standard, there are now two @implicitweights statements for Tangut ranges at the top of allkeys.txt associated with the same FB00 base weight. Parsers must accumulate ranges associated with the same base weight, rather than clobbering a prior range assignment when encountering the second range.

Other Issues

Please also check the following specific items carefully:

55 new emoji characters have been added. However, in addition to those individual characters, many new emoji sequences have been recognized, as well. If your implementation supports emoji, be sure to carefully review UTS #51, Unicode Emoji (PRI #405).

WARNING: There are multiple new ideographic ranges defined for Version 13.0.0, as well as changes to the end of several existing CJK unified ideograph ranges. Because implementations often hard-code ideographic ranges to short-cut lookups and reduce table sizes, it is especially important that implementers pay close attention to the implications of range changes for Version 13.0.0. These ideographic range changes are noted individually here. See also Blocks.txt for details.

There is a second range defined for Tangut ideographs now, for the new Tangut Supplement block. This means that Tangut is the second ideographic script (after Han) which has multiple ranges defined in multiple blocks. The Tangut Supplement block, like the main Tangut block, has character names defined by rule based on code point: TANGUT IDEOGRAPH-<code point>.

The Khitan Small Script is a new ideographic script, encoded for the first time in Version 13.0.0. This is the fourth ideographic script (after Han, Tangut, and Nushu) to use the range notation in UnicodeData.txt and to have character names defined by rule based on code point: KHITAN SMALL SCRIPT CHARACTER-<code point>.

Three existing CJK unified ideographic blocks have small extensions added at the end of the blocks. These extensions bump up the end ranges by a few code points for each block: 13 code points for the URO, 10 code points for Extension A, and 7 code points for Extension B. Implementers expect these kinds of extension for the URO, because they have happened for multiple versions of the standard. However, these are the very first such small range additions for both Extension A and Extension B. Note that the addition for Extension A also happens to completely fill the CJK Unified Ideographs Extension A block. See Section 4.4, Listing of Characters Covered by the Unihan Database in UAX #38 for the version history of all these small CJK unified ideograph additions inside existing blocks.

Finally, the new CJK Unified Ideogaphs Extension G block is the first block of assigned characters in Plane 3, the Tertiary Ideographic Plane. Implementers should check their assumptions about valid ranges past U+2FFFF, to ensure that code points in the range U+30000..U+3134A are correctly handled.

The following blocks are new in Unicode 13.0.0. Check implementations carefully for any range or property value assumptions regarding these new blocks. See also the single-block delta charts.

Range Block Name

10E80..10EBF Yezidi

10FB0..10FDF Chorasmian

11900..1195F Dives Akuru

11FB0..11FBF Lisu Supplement

18B00..18CFF Khitan Small Script

18D00..18D8F Tangut Supplement

1FB00..1FBFF Symbols for Legacy Computing

30000..3134F CJK Unified Ideographs Extension G

Some blocks have also had font updates; see the single-block delta charts for details. In such cases, careful review of the blocks in question is advised, to ensure that there have not been any regressions in representative glyph display.

General Issues

For current proposed updates to the particular UAXes, see Proposed Updates for Standard Annexes or use the links in the navigation bar on this page. Particular issues in the UAXes may also be the focus of specific Public Review Issues. Each proposed textual change in a UAX is highlighted, so that you can focus your review on those sections if you have limited time. The changes are also listed in detail in the Modifications sections (linked from the table of contents of each document), and are summarized in UAX changes, so you can check on those areas that might be of most interest.

Some links between beta documents and the proposed updates for UAXes will not work correctly during the beta review period. This is a known problem which does not need to be reported, as such links point to the eventual final names or revision numbers for the released versions.

Stability

Certain character properties for newly assigned characters cannot be changed after the formal release of each version of the standard, because of the Character Encoding Stability Policy. Such character property values need special attention during the beta review process, as they cannot be corrected after publication. These include:

Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.

The determination of whether a character is included in identifiers (XID_Start, XID_Continue).

Case mappings and case foldings.