[Unicode]   The Standard Home | Site Map | Search

BETA Unicode 4.0.1

The next version of the Unicode Standard will be Version 4.0.1, due for release in September, 2003. A BETA version of the updated Unicode Character Database files is available for public comment. We strongly encourage implementers to download these files and test them with their programs, well before the end of the beta period. These files are located in


Any comments on the beta Unicode Character Database should be reported using the Unicode reporting form. The comment period ends January 27, 2004. All substantive comments must be received by that date for consideration at the next UTC meeting. Editorial comments (typos, etc) may be submitted after that date for consideration in the file editorial work.

Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 4.0.1 is final. It is inappropriate to cite these files as other than a work in progress.

New Unihan Data

The main focus of the release of the Unicode 4.0.1 update is to make Unihan.txt available with a large number of fixes and additions since Unicode 3.2.0 -- fixes that were not available in time to be released with the Unicode Character Database for Unicode 4.0.0. Unihan.txt is available in the beta directory as a plain text file, and also as a gzipped and as a WinZipped file. For beta evaluation, please download whichever of the zipped versions you can handle, if possible, to lighten the bandwidth burden of downloading the very large Unihan.txt uncompressed text file.

Other Updates

Other updates for Unicode 4.0.1 include:

  • Index.txt has been updated to correspond to the character index published as part of the Unicode Standard, Version 4.0.
  • UnicodeData.txt has been updated with a very minor fix to remove a trailing space in two character name fields:
    < 0615;ARABIC SMALL HIGH TAH ;Mn;230;NSM;;;;;N;;;;;
    > 0615;ARABIC SMALL HIGH TAH;Mn;230;NSM;;;;;N;;;;;
    (Note: This fix is not formally necessary, since UCD.html makes it clear that data from fields is to be considered without leading or trailing spaces.)
  • ArabicShaping.txt has been updated with a very minor fix -- moving one line in the file to put it in the expected code point order.
  • PropertyAliases.txt has been updated to add two new property aliases for newly defined properties: Sentence_Terminal (based on UAX #29), and Variation_Selector. The property aliases have also been rearranged into somewhat more meaningful categories.
  • PropList.txt has been updated to add the explicit definition of the Sentence_Terminal and Variation_Selector properties. The character assignments for the Other_Math property have also been extensively modified, to reflect the UTC decision to bring the math property more closely into alignment with the discussion of math characters in UTR #25, Unicode Support for Mathematics.
  • DerivedCoreProperties.txt has been regenerated to reflect the changes in PropList.txt. The exact list of characters with the Math property is significantly different, of course. Note also that Variation Selector characters have been added, via the Variation_Selector property, into the derivation of Default_Ignorable_Code_Point, and that noncharacters have been given that property as well.

Known Issues

In Unihan.txt, decompositions for some CJK compatibility characters have not yet been updated to match Technical Corrigenda #3 and #4. Some of the compatibility mappings in Unihan.txt need to be updated, and some Mandarin readings need to be renormalized (in the kMandarin field). Unihan.txt still needs to have its IRG sources synchronized with 10646:2003.The "RELEASE NOTES" and "KNOWN ERRORS" sections of Unihan.txt list corrections and known errors.