BETA Unicode 8.0.0
Note: The beta review period for Unicode 8.0.0 has closed, as
of April 27, 2015. Feedback received during the public review can be
referred to from PRI #297. This beta review page is
left active, however, for convenience of access to the prepublication versions
of the Unicode 8.0.0 data files and annexes, until the formal release
planned for mid-June, 2015. |
The next version of the Unicode Standard will be Version 8.0.0, planned for release in
June, 2015. This is the first version that follows a predictable,
yearly release schedule. Its major features are the conversion of Cherokee to a bicameral script,
a different encoding model for New Tai Lue, and the addition of significant new repertoire.
A total of 7,716 new characters are encoded, including a large collection of CJK unified ideographs,
several popular emoji symbols and symbol modifiers for implementing skin tone diversity,
a new currency sign for the Georgian lari, 6 new scripts, and multiple additions to existing blocks.
A beta version of the 8.0.0 Unicode Character Database files is available for public review.
We strongly encourage implementers to review the summary description, download the beta 8.0.0 Unicode Character Database files,
and test their programs with the new data, well before the end of the beta period. It is especially important
to review the Notable Issues for Beta Reviewers.
We encourage users to check the code charts carefully
to verify correctness of the new characters added to Unicode 8.0.0 and to ensure that there are no regressions
in glyph shapes for previously encoded characters.
Summary description |
Unicode 8.0.0 |
Unicode character database (UCD) |
http, ftp |
Summary of beta charts |
Readme.txt
|
Single-block charts with yellow highlighting for new characters |
delta charts |
Single-block charts for all of Unicode 8.0.0 |
http, ftp
|
Code charts - single download (98MB) |
http, ftp
|
Auxiliary HTML charts for beta review (in preparation) |
HTML charts
|
Related Unicode Technical Standards
In addition to the Unicode Standard proper, two other Unicode Technical
Standards have significant text and data file updates that are
correlated with the new additions for Unicode 8.0.0. Review of that text
and data is also encouraged during the beta review period.
Review and Feedback
For guidance on how to focus your review, see the section
Notable Issues for Beta Reviewers.
Any feedback should be
reported using the contact form.
Comments on the Unicode Standard Version 8.0.0
or the Unicode Character Database data files, should refer to the beta review
Public Review Issue #297.
Comments on specific Version 8.0.0 UAXes and UTSes should refer to the respective
Public Review Issue Numbers
for each document, where available.
The comment period ends
April 27, 2015.
All substantive technical comments must have been received by that date for
consideration at the May UTC meeting. Editorial comments (typos,
etc.) may be still submitted after that date for consideration in the final
editorial work.
Note: All beta files may be updated, replaced, or
superseded by other files at any time. The beta files will be
discarded once Unicode 8.0.0 is final. It is inappropriate to cite
these files as other than a work in progress. No
products or implementations should be released based on the beta
UCD data files -- use only the final, approved Version 8.0.0 data
files, expected in June 2015.
The Unicode Consortium provides early access to updated versions of the data files
and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of
Version 8.0.0.
The assignment of characters for Unicode 8.0.0 is now stable. There will be no further
additions or modifications of code points and no further changes to character names.
Please do not submit feedback requesting changes to code points
or character names for Unicode 8.0.0, as such feedback is not actionable.
One of the main purposes of the beta review period is to verify and
correct the preliminary character property assignments in the Unicode Character
Database. Reviewers should check for property changes to existing Unicode 7.0.0
characters, as well as the property values for the new Unicode 8.0.0 character
additions. The Auxiliary
HTML charts include the new characters highlighted in yellow, with names appearing when hovering over a cell. These charts
may be useful for reviewing information such as the default collation order,
Script property assignments, and so forth during beta review.
To facilitate verification of the property changes and additions, diffable XML versions
of the Unicode Character Database are available. These XML
files are dated, so that people can check the details of changes that occurred
during the beta review period. The XML
files are in the http://www.unicode.org/Public/8.0.0/diffs/ directory. For more information,
see the
diffs.readme.txt
file.
The beta review period is a good opportunity to add support for the new
Unicode 8.0.0 characters in internal versions of software, so that software can
be tested to verify that the new characters and property assignments do not cause
problems when upgraded to Version 8.0.0 of Unicode.
Notable Issues for Beta Reviewers
Changes to Unicode Standard Annexes
Some of the Unicode Standard Annexes have modifications for
Unicode 8.0.0, often in coordination with changes to character properties.
Most notably for Unicode 8.0.0:
- In UAX #9, Unicode Bidirectional Algorithm,
the formal definition of the algorithm was updated to resolve certain edge cases in a manner more
consistent with the overall intent and behavior of the algorithm. The edge cases consist of
specific patterns of isolating run sequences within embeddings, paired brackets within overrides,
and nonspacing marks applied to paired brackets.
- In UAX #14, Unicode Line Breaking Algorithm,
a new rule was introduced to prevent line breaks at U+002F SOLIDUS between Hebrew letters, as the solidus
is extensively used in Hebrew to create gender-neutral verb forms.
- In UAX #29, Unicode Text Segmentation,
the exception list used in the derivation of the Grapheme_Cluster_Break property value SpacingMark
was updated in coordination with the encoding model change for New Tai Lue to visual order, and
to include two Ahom characters.
- In UAX #31, Unicode Identifier and Pattern Syntax,
recommendations were added for programming language specifications that employ case to distinguish between
different categories of lexical tokens. The recommendations are to use case folding operations, which are
guaranteed to be stable, instead of General_Category or Lowercase or Uppercase properties, which may change
between versions of the Unicode Standard. Text was added to draw attention to the unusual case folding of
Cherokee characters to uppercase instead of lowercase.
Table 4,
Candidate Characters for Exclusion from Identifiers was updated by adding all 6 new scripts in Unicode 8.0.0.
Core Specification Update
The core specification is undergoing extensive review, with
numerous additions for Version 8.0.0. Although the draft text for Version 8.0.0
is not yet available, specific reports of any technical or editorial
issues in the currently published core specification
are also welcome during the beta review
period. Such reports will be taken into consideration for corrections
to the Version 8.0.0 draft. (Note: The Unicode Consortium has ongoing
opportunities for subject-matter volunteers: experts interested in contributing to or
editing relevant parts of the core specification or other Unicode specifications.)
Casing and case folding of Cherokee
The character encoding model for the Cherokee script changed from unicameral to bicameral. The conversion was done
by reclassifying all existing syllables as uppercase and adding a corresponding set of lowercase syllables.
In terms of properties, the General_Category of the existing characters changed from Other_Letter to Uppercase_Letter,
and the new characters were given the value Lowercase_Letter. A new case pair for the archaic syllable mv was also added.
The casing was chosen in order to reduce the migration cost for implementations, allowing them to preserve
the font metrics for the existing characters and reduce the implications on layout. However, the formation of case pairs
by adding lowercase characters is unusual. As a result, case folding of Cherokee maps to uppercase instead of lowercase.
This mapping also has consequences on identifiers, as described in the changes to
UAX #31, Unicode Identifier and Pattern Syntax.
Change in encoding model for New Tai Lue to visual order
The character encoding model for New Tai Lue changed from logical order, in which pre-base vowels are stored
after an initial consonant, to visual order, in which the pre-base vowels are stored before the initial consonant,
as for Thai, Lao, and Tai Viet. The model was changed to better serve the primary user community in the
Xishuangbanna region of China, who have been accumulating data input and stored in visual order, and have been using
fonts with a visual order encoding to render it.
The encoding model change incurred a uniform General_Category reclassification of all New Tai Lue vowels signs
and tone marks from Spacing_Mark to Other_Letter, the assignment of the property value Logical_Order_Exception=Yes
to the pre-base vowels U+19B5..U+19B7 and U+19BA, and the addition of 176 pre-base vowel + initial consonant contractions
to the Default Unicode Collation Element Table.
A visual order model complicates syllable identification and the processes for searching and sorting.
Implementations switching to the visual order model can take advantage of techniques developed for processing Thai
script data to address the issues associated with visual order encoding, and data stored in logical order should be
carefully migrated.
Other Issues
Please also check the following specific items carefully:
- The properties Indic_Syllabic_Category and Indic_Positional_Category (renamed from Indic_Matra_Category)
were promoted from provisional to informative status. Both properties were substantially revised and expanded.
Implementations of Indic properties should be checked and upgraded carefully.
- Six new scripts were added, and the Script property of the Arabic-Indic digits U+0660..U+0669 was changed
from Common to Arabic. There have also been significant additions to the Script_Extensions property.
Implementations that process script data or use script extensions should be checked carefully.
- The Line_Break and Terminal_Punctuation properties of new punctuation marks, as well as other UCD changes listed in
UAX #44, Unicode Character Database,
should be examined closely, as they may affect some implementations.
- A total of 5,771 CJK unified ideographs were added: 9 in the main CJK Unified Ideographs block
and 5,762 in a new block, CJK Unified Ideographs Extension E.
- In the Unihan data files, over 2,800 values of the normative kIRG_JSource field were updated to reflect
the more contemporary JIS X 0213:2004 (J3, J3A, and J4) source references, replacing outdated JIS X 0212-1990 (J1)
and "Unified Japanese IT Vendors Contemporary Ideographs" (JA) source references.
- A total of 41 emoji symbols were added, including 5 symbol modifiers for implementing skin tone diversity,
U+1F3FB..U+1F3FF. Implementations should refer to the newly introduced
Draft UTR #51, Unicode Emoji for guidelines and
data for improving the interoperability of emoji characters.
- In UTS #39, Unicode Security Mechanisms,
the confusable data types SL, SA, and ML, and the corresponding mapping tables, were eliminated, leaving
only MA (mixed script, any case) as the single data type and mapping table.
Implementations should follow the guidelines in the
Migration section of UTS #39.
The following blocks are new in Unicode 8.0.0. Check implementations
carefully for any range or property value assumptions regarding
these new blocks.
Range | Block Name |
AB70..ABBF |
Cherokee Supplement |
108E0..108FF |
Hatran |
10C80..10CFF |
Old Hungarian |
11280..112AF |
Multani |
11700..1173F |
Ahom |
12480..1254F |
Early Dynastic Cuneiform |
14400..1467F |
Anatolian Hieroglyphs |
1D800..1DAAF |
Sutton SignWriting |
1F900..1F9FF |
Supplemental Symbols and Pictographs |
2B820..2CEAF |
CJK Unified Ideographs Extension E |
General Issues
For current proposed updates to the particular UAXes, see
Proposed Updates for Standard Annexes
or use the links in the navigation bar on this page.
Particular issues in the UAXes may also be the focus of specific
Public Review Issues.
Each proposed textual change in a UAX is highlighted, so that you can focus
your review on those sections if you have limited time. The changes
are also listed in detail in the Modifications sections (linked from the table
of contents of each document), and are summarized in
UAX changes,
so you can check on those areas that might be of most
interest.
Some links between beta documents and the proposed
updates for UAXes will not work correctly during the
beta review period. This is a known problem which does
not need to be reported, as such links point to
the eventual final names or revision numbers for the
released versions.
Stability
Certain character properties for newly assigned characters cannot be
changed after the formal release of each version of the standard, because of the
Character Encoding Stability Policy.
Such character property values need special attention during the beta review process, as they
cannot be corrected after publication. These include:
- Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.
- The determination of whether a character is included in identifiers (XID_Start, XID_Continue).
- Case mappings and case foldings.