BETA UnicodeĀ® 13.0.0
Note: The beta review period for Unicode 13.0.0 has closed,
as of
January 6, 2020. Feedback received during the public review can be
referred to from PRI #412.
This beta review page is
left active, however, for convenience of access to the prepublication versions
of the Unicode 13.0.0 data files and annexes, until the formal release
planned for March 10, 2020. |
The next version of the Unicode Standard will be Version 13.0.0, planned for release on
March 10, 2020. This version updates several annexes to deal with
segmentation issues and adds significant new repertoire.
A total of 5,930 new characters are encoded, including
55 new emoji characters,
four new scripts, and multiple additions to existing blocks.
A beta version of the 13.0.0 Unicode Character Database files is available for public review.
We strongly encourage implementers to review the summary description,
download the beta 13.0.0 Unicode Character Database files,
and test their programs with the new data, well before the end of the beta period. It is especially important
to review the Notable Issues for Beta Reviewers.
We encourage users to check the code charts carefully
to verify correctness of the new characters added to Unicode 13.0.0 and to ensure
that there are no regressions
in glyph shapes for previously encoded characters.
Related Unicode Technical Standards
In addition to the Unicode Standard proper, four other Unicode Technical
Standards have significant text and data file updates that are
correlated with the new additions for Unicode 13.0.0. Review of that text
and data is also encouraged during the beta review period.
Review and Feedback
For guidance on how to focus your review, see the section
Notable Issues for Beta Reviewers.
Any feedback should be
reported using the contact form.
Comments on the Unicode Standard Version 13.0.0
or the Unicode Character Database data files should refer to the beta review
Public Review Issue #412.
Comments on specific Version 13.0.0 UAXes and UTSes should refer to the respective
Public Review Issue Numbers
for each document, where available.
The comment period ends
January 6, 2020.
All substantive technical comments must have been received by that date for
consideration at the January UTC meeting. Editorial comments (typos,
etc.) may be still submitted after that date for consideration in the final
editorial work.
Note: All beta files may be updated, replaced, or
superseded by other files at any time. The beta files will be
discarded once Unicode 13.0.0 is final. It is inappropriate to cite
these files as other than a work in progress. No
products or implementations should be released based on the beta
UCD data files—use only the final, approved Version 13.0.0 data
files, expected on March 10, 2020.
The Unicode Consortium provides early access to updated versions of the data files
and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of
Version 13.0.0.
The assignment of characters for Unicode 13.0.0 is
now stable. There will be no further
additions or modifications of code points and no further changes to character names.
Please do not submit feedback requesting changes to code points
or character names for Unicode 13.0.0, as such feedback is not actionable.
One of the main purposes of the beta review period is to verify and
correct the preliminary character property assignments in the Unicode Character
Database. Reviewers should check for property changes to existing Unicode 12.1.0
characters, as well as the property values for the new Unicode 13.0.0 character
additions. The Auxiliary
HTML charts [not yet available] include the new characters highlighted in yellow, with names
appearing when hovering over a cell. These charts
may be useful for reviewing information such as the default collation order,
Script property assignments, and so forth during beta review.
To facilitate verification of the property changes and additions,
diffable XML versions
of the Unicode Character Database are available. These XML
files are dated, so that people can check the details of changes that occurred
during the beta review period. For more information,
see the
diffs.readme.txt
file.
The beta review period is a good opportunity to add support for the new
Unicode 13.0.0 characters in internal versions of software, so that software can
be tested to verify that the new characters and property assignments do not cause
problems when upgraded to Version 13.0.0 of Unicode.
Notable Issues for Beta Reviewers
Changes to Unicode Standard Annexes
Some of the Unicode Standard Annexes have modifications for
Unicode 13.0.0, often in coordination with changes to character properties.
Most notably for Unicode 13.0.0:
- UAX #14, Unicode Line Breaking Algorithm has significant changes for a couple
of rules. LB22 was changed to disallow breaking before ellipsis. LB20 was
changed to better account for break opportunities around East Asian
opening and closing delimiters.
- UAX #38, Unicode Han Database (Unihan) has significant updates to
document new properties, and to correct regular expressions for many
others.
See the Modifications section of each Annex for details of the relevant changes.
Core Specification Update
The core specification is undergoing extensive review, with
numerous additions for Version 13.0.0. Although the draft text for Version 13.0.0
is not yet available, specific reports of any technical or editorial
issues in the currently published core specification
are also welcome during the beta review
period. Such reports will be taken into consideration for corrections
to the Version 13.0.0 draft. (Note: The Unicode Consortium has ongoing
opportunities for subject-matter volunteers: experts interested in contributing to or
editing relevant parts of the core specification or other Unicode specifications.)
Script-specific Issues
Four new scripts have been added in Unicode 13.0.0. Some of these scripts have
particular attributes which may cause issues for implementations. The more
important of these attributes are summarized here.
- Dives Akuru is a complex script of the Indic type.
- Khitan Small Script has rules for stacking characters into
phonogram clusters. One new, Khitan-specific format control
character is used to distinguish between two patterns for
phonogram clusters. And the Khitan Small Script is traditionally
laid out in vertical orientation.
New Data Files Added to the UCD
- WARNING: Two of the emoji data files have been formally incorporated into the UCD for
Version 13.0.0. These files are located in a new emoji/ subdirectory of
the main ucd/ directory. See UTS #51 and UAX #44 for details.
- emoji-data.txt specifies six emoji-related binary properties, which
assist in the identification and parsing of emoji, and
which are relevant to Unicode segmentation algorithms.
- emoji-variation-sequences.txt specifies the emoji variation sequences,
which enable control of emoji presentation versus text presentation of
emoji characters. The format of this file is the same as that used for
StandardizedVariants.txt.
- Other data files related to emoji sequences, as well as the emoji test
file, are located in the /Public/emoji/13.0/ directory associated
with UTS #51. Implementations should be prepared to adapt to the new
locations of some data files.
- There have been no significant changes to the format of any of the
normative data content of the emoji data files; however, in the comment
section of the data lines, emoji version information has replaced the
Unicode version information associated with characters and sequences.
Casing Issues
Only three new Latin case pairs have been added in Version 13.0.0, and
there are no changes for casing in other scripts. However, implementations
of case mapping and case folding should be checked to ensure they account
correctly for the new case pairs.
General Character Property Issues
There are a number of issues related to particular character properties:
- A new Canonical_Combining_Class value of ccc=6 has been added for
two Vietnamese Han reading marks. Implementations should be checked to
ensure that their handling of combining class values does not fail when
encountering this new value.
- A new value of the Indic_Positional_Category property has been added:
Top_And_Bottom_And_Left.
Numeric Property Issues
- A new set of decimal digits has been added for the Dives Akuru script.
- A new set of compatibility decimal digits has been added, for segmented
(LED-like) digit display support for legacy computer graphic symbol sets.
- No characters with unusual fractional numeric values or very large integer
values have been added in this version.
Unihan-related Issues
All Unihan
properties should be reviewed carefully. Additionally, the following
deserve special attention:
- Three obsolete provisional properties have been removed: kRSJapanese, kRSKanWa, kRSKorean.
- Two new normative source properties have been added: kIRG_SSource, kIRG_UKSource. with
values split off from kIRG_USource. These
properties involve data for the CJK charts and have some impact on the distribution of
sources in those charts.
- A new informative property has been added: kUnihanCore2020. This is intended as a more
useful indicator of the basic Han set to support, superseding the function of kIICore.
- WARNING: One informative property, kTotalStrokes, has been moved from the Unihan
subfile Unihan_DictionaryLikeData.txt to the subfile Unihan_IRGSources.txt. This change
may impact implementations that parse for that particular Unihan property value.
- There are large changes in the values for kSimplifiedVariant, kTraditionalVariant, and
kZVariant, and many additions for the new kSpoofingVariant property.
See UAX #38 for further details on these changes, especially Section 4.2, Listing
by Date of Addition to the Unicode Standard, and Section 4.3, Listing by
Location within Unihan.zip.
UAX #38 also has updated regex values for numerous
Unihan properties.
Standardized Variation Sequences
Two new standardized variation sequences were added to emoji-variation-sequences.txt to
distinguish text presentation and emoji presentation forms of U+26A7 MALE WITH STROKE
AND MALE AND FEMALE SIGN. This results from the new use of U+26A7 in an emoji sequence
defined for Version 13.0.0.
Code Charts
As always, careful review of the updated code charts for Version 13.0.0 is advised,
especially for all newly added scripts.
Particular issues to take note of include:
- The font for the Kangxi Radicals and CJK Radicals Supplement blocks has
been updated, so that it more accurately represents the
actual forms of Kangxi radicals and the variant radicals. This new font is also used
for the indexing radical shown in the CJK unified ideograph blocks in
the code charts, as well as in the updated radical-stroke indexes for
Version 13.0.0.
- The format for the Mongolian code chart has been substantially revised,
removing all details about positional variants and standardized variation
sequences. The old format, showing all the variant glyphs, is preserved
in UTR #54, Unicode Mongolian 12.1 Baseline. Note that future updates
to the Mongolian model and the rules for rendering and interpretation of
variation sequences, will be worked out in a separate specification,
instead of being documented in the basic code chart for Mongolian.
Collation-related Issues
The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 13.0.0
repertoire for UCA 13.0. For the most part, the additions for new scripts and other
characters are unremarkable, but implementations should be checked to ensure
the new additions do not cause problems.
The following issue is of particular note for collation implementations that
parse allkeys.txt:
- Because of the addition of a second, non-contiguous range of Tangut ideographs
to the standard, there are now two @implicitweights statements for Tangut
ranges at the top of allkeys.txt associated with the same FB00 base weight.
Parsers must accumulate ranges associated with the same base weight,
rather than clobbering a prior range assignment when encountering the second
range.
Other Issues
Please also check the following specific items carefully:
- 55 new emoji characters have been added. However, in addition
to those individual characters, many new emoji sequences have been
recognized, as well. If your implementation supports emoji,
be sure to carefully review
UTS #51, Unicode Emoji
(PRI #405).
- WARNING: There are multiple new ideographic ranges defined for
Version 13.0.0, as well as changes to the end of several existing
CJK unified ideograph ranges. Because implementations often hard-code
ideographic ranges to short-cut lookups and reduce table sizes, it is
especially important that implementers pay close attention to the
implications of range changes for Version 13.0.0. These ideographic
range changes are noted individually here. See also Blocks.txt for
details.
- There is a second range defined for Tangut ideographs now, for the
new Tangut Supplement block. This means that Tangut is the second
ideographic script (after Han) which has multiple ranges defined in
multiple blocks. The Tangut Supplement block, like the main Tangut
block, has character names defined by rule based on code point:
TANGUT IDEOGRAPH-<code point>.
- The Khitan Small Script is a new ideographic script, encoded for
the first time in Version 13.0.0. This is the fourth ideographic
script (after Han, Tangut, and Nushu) to use the range notation
in UnicodeData.txt and to have character names defined by rule based
on code point: KHITAN SMALL SCRIPT CHARACTER-<code point>.
- Three existing CJK unified ideographic blocks have small extensions
added at the end of the blocks. These extensions bump up the end
ranges by a few code points for each block: 13 code points for the URO,
10 code points for Extension A, and 7 code points for Extension B.
Implementers expect these kinds of extension for the URO, because they
have happened for multiple versions of the standard. However, these are the very first
such small range additions for both Extension A and Extension B.
Note that the addition for Extension A also happens to completely fill
the CJK Unified Ideographs Extension A block.
See Section 4.4, Listing of Characters Covered by the Unihan Database
in UAX #38
for the version history of all these small CJK unified ideograph additions
inside existing blocks.
- Finally, the new CJK Unified Ideogaphs Extension G block is the
first block of assigned characters in Plane 3, the Tertiary Ideographic Plane.
Implementers should check their assumptions about valid ranges past
U+2FFFF, to ensure that code points in the range U+30000..U+3134A are correctly handled.
The following blocks are new in Unicode 13.0.0. Check implementations
carefully for any range or property value assumptions regarding
these new blocks. See also the single-block delta charts.
Range |
Block Name |
10E80..10EBF |
Yezidi |
10FB0..10FDF |
Chorasmian |
11900..1195F |
Dives Akuru |
11FB0..11FBF |
Lisu Supplement |
18B00..18CFF |
Khitan Small Script |
18D00..18D8F |
Tangut Supplement |
1FB00..1FBFF |
Symbols for Legacy Computing |
30000..3134F |
CJK Unified Ideographs Extension G |
Some blocks have also had font updates; see the
single-block delta charts for details.
In such cases, careful review of the blocks in question
is advised, to ensure that there have not been any
regressions in representative glyph display.
General Issues
For current proposed updates to the particular UAXes, see
Proposed Updates for Standard Annexes
or use the links in the navigation bar on this page.
Particular issues in the UAXes may also be the focus of specific
Public Review Issues.
Each proposed textual change in a UAX is highlighted, so that you can focus
your review on those sections if you have limited time. The changes
are also listed in detail in the Modifications sections (linked from the table
of contents of each document), and are summarized in
UAX changes,
so you can check on those areas that might be of most
interest.
Some links between beta documents and the proposed
updates for UAXes will not work correctly during the
beta review period. This is a known problem which does
not need to be reported, as such links point to
the eventual final names or revision numbers for the
released versions.
Stability
Certain character properties for newly assigned characters cannot be
changed after the formal release of each version of the standard, because of the
Character Encoding Stability Policy.
Such character property values need special attention during the beta review process, as they
cannot be corrected after publication. These include:
- Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.
- The determination of whether a character is included in identifiers (XID_Start, XID_Continue).
- Case mappings and case foldings.