BETA UnicodeĀ® 15.0.0
Note: The beta review period for Unicode 15.0.0 has closed,
as of
July 12, 2022. Feedback received during the public review can be
referred to from PRI #453.
This beta review page is
left active, however, for convenience of access to the prepublication versions
of the Unicode 15.0.0 data files and annexes, until the formal release
planned for September 13, 2022. |
The next version of the Unicode Standard will be Version 15.0.0, planned for release on
September 13, 2022. This version updates several annexes to deal with
segmentation issues and adds significant new repertoire.
A total of 4489 new characters are encoded, including
20 new emoji characters,
two new scripts, and multiple additions to existing blocks.
A beta version of the 15.0.0 Unicode Character Database files is available for public review.
We strongly encourage implementers to review the summary description,
download the beta 15.0.0 Unicode Character Database files,
and test their programs with the new data, well before the end of the beta period. It is especially important
to review the Notable Issues for Beta Reviewers.
We encourage users to check the code charts carefully
to verify correctness of the new characters added to Unicode 15.0.0 and to ensure
that there are no regressions
in glyph shapes for previously encoded characters.
Related Unicode Technical Standards
In addition to the Unicode Standard proper, four other Unicode Technical
Standards have significant text and data file updates that are
correlated with the new additions for Unicode 15.0.0. Review of that text
and data is also encouraged during the beta review period.
Review and Feedback
For guidance on how to focus your review, see the section
Notable Issues for Beta Reviewers.
Any feedback should be
reported using the contact form.
Comments on the Unicode Standard Version 15.0.0
or the Unicode Character Database data files should refer to the beta review
Public Review Issue #453.
Comments on specific Version 15.0.0 UAXes and UTSes should refer to the respective
Public Review Issue Numbers
for each document, where available.
The comment period ends
July 12, 2022.
All substantive technical comments must have been received by that date for
consideration at the July UTC meeting. Editorial comments (typos,
etc.) may be still submitted after that date for consideration in the final
editorial work.
Note: All beta files may be updated, replaced, or
superseded by other files at any time. The beta files will be
discarded once Unicode 15.0.0 is final. It is inappropriate to cite
these files as other than a work in progress. No
products or implementations should be released based on the beta
UCD data files—use only the final, approved Version 15.0.0 data
files, expected on September 13, 2022.
The Unicode Consortium provides early access to updated versions of the data files
and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of
Version 15.0.0.
The assignment of characters for Unicode 15.0.0 is
now stable. There will be no further
additions or modifications of code points and no further changes to character names.
Please do not submit feedback requesting changes to code points
or character names for Unicode 15.0.0, as such feedback is not actionable.
One of the main purposes of the beta review period is to verify and
correct the preliminary character property assignments in the Unicode Character
Database. Reviewers should check for property changes to existing Unicode 14.0.0
characters, as well as the property values for the new Unicode 15.0.0 character
additions. The Auxiliary
HTML charts include the new characters highlighted in yellow, with names
appearing when hovering over a cell. These charts
may be useful for reviewing information such as the default collation order,
Script property assignments, and so forth during beta review.
To facilitate verification of the property changes and additions,
diffable XML versions
of the Unicode Character Database are available. For more information,
see the
diffs.readme.txt
file.
The beta review period is a good opportunity to add support for the new
Unicode 15.0.0 characters in internal versions of software, so that software can
be tested to verify that the new characters and property assignments do not cause
problems when upgraded to Version 15.0.0 of Unicode.
Notable Issues for Beta Reviewers
Changes to Unicode Standard Annexes
Some of the Unicode Standard Annexes have modifications for
Unicode 15.0.0, often in coordination with changes to character properties.
Most notably for Unicode 15.0.0:
- There are a series of updates to UAX #31, Unicode Identifier and
Pattern Syntax and to UTS #39, Unicode Security Mechanisms, to
clarify issues regarding identifiers in programming languages,
particularly in bidirectional contexts, as well as the use of
ZWJ and ZWNJ in identifiers. A coordinated example was also added
to UAX #9, Unicode Bidirectional Algorithm, to illustrate the appropriate
use of a higher-level protocol for identifiers in a bidirectional
context. See the Modifications sections of these three specifications
for details.
- In UAX #38, Unicode Han Database (Unihan) there have been significant
updates to the descriptions of
some data fields, and additional information is provided
about sources. UAX #38 also has updated regex values for numerous
Unihan properties.
- In UAX #45, U-Source Ideographs, a new status value has been added
for Extension H, and there is more description provided for the IDS field.
See the Modifications section of each Annex for details of the relevant changes.
Core Specification Update
The core specification is undergoing extensive review, with
numerous additions for Version 15.0.0. Although the draft text for Version 15.0.0
is not yet available, specific reports of any technical or editorial
issues in the currently published core specification
are also welcome during the beta review
period. Such reports will be taken into consideration for corrections
to the Version 15.0.0 draft. (Note: The Unicode Consortium has ongoing
opportunities for subject-matter volunteers: experts interested in contributing to or
editing relevant parts of the core specification or other Unicode specifications.)
Script-specific Issues
Two new scripts have been added in Unicode 15.0.0. Some of these scripts have
particular attributes which may cause issues for implementations. The more
important of these attributes are summarized here.
- Kawi is a Brahmic script with complex rendering rules. See the original proposal
documentation in L2/20-284
for an extensive discussion. Note also that the UTC recommendation for
handling linebreaking in Kawi is to follow Western linebreaking rules,
depending on use of spaces in text, rather than depending on dictionary
lookup rules.
Numeric Property Issues
- Two new sets of decimal digits have been added, for the Kawi and Nag
Mundari scripts.
Implementations of digits will need to take those
into account.
- Kaktovik numerals have been added. This is another vigesimal
number system, similar in structure to Mayan numerals.
Unihan-related Issues
All Unihan
properties should be reviewed carefully. Additionally, the following
deserve special attention:
- A new provisional property, kAlternateTotalStrokes, has been added to Unihan. This property supplements the existing informative kTotalStrokes property with total number of strokes for ideographs other than those with G and T source identifiers.
- Nearly 50,000 additions to the kKangXi property were derived from the kIRG_GSource and kIRGKangXi properties.
- There are large changes and additions in the values for the kDefinition, kSimplifiedVariant, kTraditionalVariant, kSemanticVariant, and kSpecializedSemanticVariant properties.
- The kCihaiT property has been moved from the Unihan_DictionaryLikeData.txt file
to the Unihan_DictionaryIndices.txt file. Parsers that assume that particular
Unihan properties are included in particular parts of the Unihan database files
will need to be updated. [Post UTC-#172 change in Unihan.]
See UAX #38 for further details on these changes, especially Section 4.2, Listing
by Date of Addition to the Unicode Standard, and Section 4.3, Listing by
Location within Unihan.zip.
UAX #38 also has updated regex values for numerous
Unihan properties.
Code Charts
As always, careful review of the updated code charts for Version 15.0.0 is advised,
especially for all newly added scripts.
Particular issues to take note of include:
- There was a significant update in the fonts used for many CJK extended blocks,
to improve the design and consistency of glyphs. Details of the affected ranges
of glyphs can be found in the Glyph and Variation Sequence Changes table
on the
single block delta charts page.
- There have also been systematic updates to many glyphs in the
UCAS and
UCAS Extended
blocks, to more accurately reflect current practice, particularly for Carrier.
Collation-related Issues
The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 15.0.0
repertoire for UCA 15.0. For the most part, the additions for new scripts and other
characters are unremarkable, but implementations should be checked to ensure
the new additions do not cause problems.
Other Issues
Please also check the following specific items carefully:
- 20 new emoji characters have been added. However, in addition
to those individual characters, many new emoji sequences have been
recognized, as well. If your implementation supports emoji,
be sure to carefully review
UTS #51, Unicode Emoji
(PRI #454).
WARNING: There is a change to the end of one existing
CJK unified ideograph range in Unicode 15.0.0. Because implementations often hard-code
ideographic ranges to short-cut lookups and reduce table sizes, it is
especially important that implementers pay close attention to the
implications of range changes for Version 15.0.0. This extension bumps up the end
range of the encoded ideographs by one code point within the block:
- 1 code points for Extension C: ending at U+2B739
See Section 4.4,
Listing of Characters Covered by the Unihan Database
in UAX #38
for the version history of all these small CJK unified ideograph additions
inside existing blocks.
The following blocks are new in Unicode 15.0.0. Check implementations
carefully for any range or property value assumptions regarding
these new blocks. See also the single-block delta charts.
Range |
Block Name |
10EC0..10EFF |
Arabic Extended-C |
11B00..11B5F |
Devanagari Extended-A |
11F00..11F5F |
Kawi |
1D2C0..1D2DF |
Kaktovik Numerals |
1E030..1E08F |
Cyrillic Extended-D |
1E4D0..1E4FF |
Nag Mundari |
31350..323AF |
CJK Unified Ideographs Extension H |
The new Arabic block, Arabic Extended-C, defaults the entire range
of code points in the block, 10EC0..10EFF to Bidi_Class=AL. This is
a change from Unicode 14.0, in which that unassigned range defaulted
to Bidi_Class=R.
In addition to the new blocks, one existing block had a slight adjustment to its
end range. The Egyptian Format Controls block range was extended by two columns to end at U+1345F, instead of U+1343F.
Implementations should be checked carefully for any hard-coded assumptions about
the end ranges of existing blocks.
Some blocks have also had font updates; see the
single-block delta charts for details.
In such cases, careful review of the blocks in question
is advised, to ensure that there have not been any
regressions in representative glyph display.
Starting with Version 15.0, some data files in the UCD may contain multiple @missing lines defined for the same property. This is currently the case for DerivedBidiClass.txt. UCD file parsers will need to be updated to treat the additional @missing lines like data lines. See UAX #44 Section 4.2.10, @missing Conventions for details.
The file IdnaTestV2.txt is now written with certain characters escaped using the \uXXXX and \x{XXXX} conventions. This was already documented in the file header, and the same escaping conventions were used in the earlier IdnaTest.txt file.
General Issues
For current proposed updates to the particular UAXes, see
Proposed Updates for Standard Annexes
or use the links in the navigation bar on this page.
Particular issues in the UAXes may also be the focus of specific
Public Review Issues.
Each proposed textual change in a UAX is highlighted, so that you can focus
your review on those sections if you have limited time. The changes
are also listed in detail in the Modifications sections (linked from the table
of contents of each document), and are summarized in
UAX changes,
so you can check on those areas that might be of most
interest.
Some links between beta documents and the proposed
updates for UAXes will not work correctly during the
beta review period. This is a known problem which does
not need to be reported, as such links point to
the eventual final names or revision numbers for the
released versions.
Stability
Certain character properties for newly assigned characters cannot be
changed after the formal release of each version of the standard, because of the
Character Encoding Stability Policy.
Such character property values need special attention during the beta review process, as they
cannot be corrected after publication. These include:
- Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.
- The determination of whether a character is included in identifiers (XID_Start, XID_Continue).
- Case mappings and case foldings.