BETA Unicode 5.0.0
The next version of the Unicode Standard will be Version 5.0.0.
The beta version of the documentation for Unicode 5.0.0 is located
in:
http://www.unicode.org/versions/Unicode5.0.0/
The Unicode Character Database portion is planned for release at
the end of May 2006. A beta version of the 5.0.0 Unicode Character
Database files is available for public comment.
We strongly encourage implementers
to download these files and test them with their programs, well
before the end of the beta period. These files are located in:
http://www.unicode.org/Public/5.0.0/
ftp://www.unicode.org/Public/5.0.0/
Any comments on the beta Unicode Character Database should be
reported using the Unicode
reporting form. The comment period ends May 9, 2006.
All substantive comments must be received by that date for
consideration at the next UTC meeting. Editorial comments (typos,
etc) may be submitted after that date for consideration in the final
editorial work.
Note: Except as noted below, all beta files may be updated, replaced, or
superseded by other files at any time. Derived and extracted data
files may not always be completely in synch with the primary data files
at all times. The beta files will be
discarded once Unicode 5.0.0 is final. It is inappropriate to cite
these files as other than a work in progress.
The Unicode Consortium provides early
access to the best known version of the data files to give reviewers and
developers as much time as possible to ensure a problem-free adoption of version
5.0.0.
Frozen Data and Data Files (March 2006)
The UTC has frozen
a subset of the Unicode Character Database
as of March 7, 2006, while still allowing beta review to continue
on other portions of the data files not affected by this
freeze.
Note that in some cases, the freeze applies to the entire
data file, which will remain untouched through the remainder
of the Unicode 5.0 beta review period. In other cases, the
freeze applies to specific properties only. In the latter
case, the defining data file might see further updates to
other properties or comment fields, but will not be changed
in any may that materially impacts the particular frozen
properties. This is to allow implementations which depend
on certain core property values to proceed with early
use of the data files, knowing that the values will be stable
and unchanged for the eventual Unicode 5.0 release.
Below is an exact list of which files and properties are now
officially frozen for Unicode 5.0.
1. Data files frozen
UnicodeData.txt (at UnicodeData-5.0.0d10.txt)
EastAsianWidth.txt (at EastAsianWidth-5.0.0d3.txt)
Scripts.txt (at Scripts-5.0.0d14.txt)
The freezing of these data files implies that all properties
defined in UnicodeData.txt are frozen, as well as the
East_Asian_Width and Script properties.
2. Other properties frozen
White_Space
Hex_Digit
Diacritic
Ideographic
Numeric_Type
Numeric_Value
White_Space, Hex_Digit, Diacritic, and Ideographic are defined
in PropList.txt. Their values will remain unchanged, but it
is possible that PropList.txt itself will change further,
because it also defines other properties that are not yet
frozen.
Numeric_Type and Numeric_Value are based on values defined
in UnicodeData.txt and Unihan.txt. They are explicitly listed in
the derived data files, DerivedNumericType.txt and
DerivedNumericValues.txt. The property freeze guarantees that
no material change will be made in those two derived data
files, although it is possible that the delta level of the
files may change to fix informative material such as comment
headers in the files. Most of the other derived data files
will, by implication, be similarly stabilized, since they
are based, for the most part, on property values from
the frozen UnicodeData.txt.
The ReadMe.txt in the 5.0.0/ucd directory has been updated
to reflect the status of the frozen files and other properties.
Notable Issues for Beta Testers
The beta version of the UCD includes three files which are intended to
help beta testers evaluate the changes since 4.1.0. Those files are
not part of the UCD, and will be present only during the beta period.
Those files are only informative. While we believe that those files
present accurate data, there is no guarantee that the data is indeed
accurate. The files are not intended to be machine readable.
http://www.unicode.org/Public/5.0.0/diffs/
Details of the diffs are outlined in the
Readme.txt
file. The
4.1.0-5.0.0.nounihan.0.diffs.txt file gives only a summary, in the form of the number of
things which have changed; the
4.1.0-5.0.0.nounihan.1.diffs.txt file adds a list of the new
characters, and a list of the property changes for existing
characters; the
4.1.0-5.0.0.nounihan.2.diffs.txt file adds the property changes for the new
characters. None of the files covers the Unihan properties.
These diff files will be refreshed on an ongoing basis during the beta
period, but may be slightly out of sync with the data files at any given time.
They will be discarded at the end of the beta period.
In addition to the repertoire additions which need to
be checked and tested, there have been a number of significant
changes to the Unicode Character Database files and the
properties in them. In particular:
- Case mappings have been updated in instances where the case pair of a
formerly uncased character has been added. This is to allow for
case-folding stability from Unicode 5.0.0 forwards. Note the case
mapping of glottal stops, in particular.
- In LineBreak.txt, in addition to classifications for all of the new
characters, a number of Southeast Asian characters
have been re-classified.
- Word_Break=ALetter is now defined in terms of Linebreak=Complex_Context (SA).
- Bidi_Mirrored has undergone revision, as per the resolution of
PRI #80.
Beta testers should take note of the following newly added
complete scripts to check for character properties,
in particular, and investigate rendering and font support.
- N'Ko
- Balinese
- Phags-pa
- Phoenician
- Cuneiform
Note: Unicode 5.0.0 includes four characters from PDAM 3 of ISO
10646:2003. The publication of these characters has been accelerated to meet
requirements for implementations supporting Sindhi.
A number of Unicode Technical Annexes are being modified in
Unicode 5.0.0, and may be coordinated with changes in properties. To see the
current proposed updates to the particular UAXes, see
Technical Reports.
Particular issues in the UAX will also be the focus of specific
Public Review Issues.