[Unicode]  The Standard Home | Site Map | Search
 

BETA Unicode 5.0.0

The next version of the Unicode Standard will be Version 5.0.0. The beta version of the documentation for Unicode 5.0.0 is located in:

http://www.unicode.org/versions/Unicode5.0.0/

The Unicode Character Database portion is planned for release at the end of May 2006. A beta version of the 5.0.0 Unicode Character Database files is available for public comment. We strongly encourage implementers to download these files and test them with their programs, well before the end of the beta period. These files are located in:

http://www.unicode.org/Public/5.0.0/
ftp://www.unicode.org/Public/5.0.0/

Any comments on the beta Unicode Character Database should be reported using the Unicode reporting form. The comment period ends May 9, 2006. All substantive comments must be received by that date for consideration at the next UTC meeting. Editorial comments (typos, etc) may be submitted after that date for consideration in the final editorial work.

Note: Except as noted below, all beta files may be updated, replaced, or superseded by other files at any time. Derived and extracted data files may not always be completely in synch with the primary data files at all times. The beta files will be discarded once Unicode 5.0.0 is final. It is inappropriate to cite these files as other than a work in progress.

The Unicode Consortium provides early access to the best known version of the data files to give reviewers and developers as much time as possible to ensure a problem-free adoption of version 5.0.0.

Frozen Data and Data Files (March 2006)

The UTC has frozen a subset of the Unicode Character Database as of March 7, 2006, while still allowing beta review to continue on other portions of the data files not affected by this freeze.

Note that in some cases, the freeze applies to the entire data file, which will remain untouched through the remainder of the Unicode 5.0 beta review period. In other cases, the freeze applies to specific properties only. In the latter case, the defining data file might see further updates to other properties or comment fields, but will not be changed in any may that materially impacts the particular frozen properties. This is to allow implementations which depend on certain core property values to proceed with early use of the data files, knowing that the values will be stable and unchanged for the eventual Unicode 5.0 release.

Below is an exact list of which files and properties are now officially frozen for Unicode 5.0.

1. Data files frozen

UnicodeData.txt (at UnicodeData-5.0.0d10.txt)
EastAsianWidth.txt (at EastAsianWidth-5.0.0d3.txt)
Scripts.txt (at Scripts-5.0.0d14.txt)

The freezing of these data files implies that all properties defined in UnicodeData.txt are frozen, as well as the East_Asian_Width and Script properties.

2. Other properties frozen

White_Space
Hex_Digit
Diacritic
Ideographic
Numeric_Type
Numeric_Value

White_Space, Hex_Digit, Diacritic, and Ideographic are defined in PropList.txt. Their values will remain unchanged, but it is possible that PropList.txt itself will change further, because it also defines other properties that are not yet frozen.

Numeric_Type and Numeric_Value are based on values defined in UnicodeData.txt and Unihan.txt. They are explicitly listed in the derived data files, DerivedNumericType.txt and DerivedNumericValues.txt. The property freeze guarantees that no material change will be made in those two derived data files, although it is possible that the delta level of the files may change to fix informative material such as comment headers in the files. Most of the other derived data files will, by implication, be similarly stabilized, since they are based, for the most part, on property values from the frozen UnicodeData.txt.

The ReadMe.txt in the 5.0.0/ucd directory has been updated to reflect the status of the frozen files and other properties.

Notable Issues for Beta Testers

The beta version of the UCD includes three files which are intended to help beta testers evaluate the changes since 4.1.0. Those files are not part of the UCD, and will be present only during the beta period. Those files are only informative. While we believe that those files present accurate data, there is no guarantee that the data is indeed accurate. The files are not intended to be machine readable.

http://www.unicode.org/Public/5.0.0/diffs/

Details of the diffs are outlined in the Readme.txt file. The 4.1.0-5.0.0.nounihan.0.diffs.txt file gives only a summary, in the form of the number of things which have changed; the 4.1.0-5.0.0.nounihan.1.diffs.txt file adds a list of the new characters, and a list of the property changes for existing characters; the 4.1.0-5.0.0.nounihan.2.diffs.txt file adds the property changes for the new characters. None of the files covers the Unihan properties.

These diff files will be refreshed on an ongoing basis during the beta period, but may be slightly out of sync with the data files at any given time. They will be discarded at the end of the beta period.

In addition to the repertoire additions which need to be checked and tested, there have been a number of significant changes to the Unicode Character Database files and the properties in them. In particular:

  • Case mappings have been updated in instances where the case pair of a formerly uncased character has been added. This is to allow for case-folding stability from Unicode 5.0.0 forwards. Note the case mapping of glottal stops, in particular.
  • In LineBreak.txt, in addition to classifications for all of the new characters, a number of Southeast Asian characters have been re-classified.
  • Word_Break=ALetter is now defined in terms of Linebreak=Complex_Context (SA).
  • Bidi_Mirrored has undergone revision, as per the resolution of PRI #80.

Beta testers should take note of the following newly added complete scripts to check for character properties, in particular, and investigate rendering and font support.

  • N'Ko
  • Balinese
  • Phags-pa
  • Phoenician
  • Cuneiform

Note: Unicode 5.0.0 includes four characters from PDAM 3 of ISO 10646:2003. The publication of these characters has been accelerated to meet requirements for implementations supporting Sindhi.

A number of Unicode Technical Annexes are being modified in Unicode 5.0.0, and may be coordinated with changes in properties. To see the current proposed updates to the particular UAXes, see Technical Reports. Particular issues in the UAX will also be the focus of specific Public Review Issues.