[Unicode]  Frequently Asked Questions Home | Site Map | Search

Unicode Character Database

Q: What is the Unicode Character Database?

A: It is a set of data files defining character properties and other information about Unicode characters. It is commonly known by the acronym "UCD".

Q: Where can I find the Unicode Character Database?

A: The latest version of the UCD is always found online at: http://www.unicode.org/Public/UCD/latest/.

Q: Where can I find general information about the Unicode Character Database?

A: See About the Unicode Character Database.

Q: O.k., but I need detailed information about the data file structure and character properties. Where do I find that?

A: Unicode Standard Annex #44, Unicode Character Database, provides the detailed documentation about the UCD, including file formats, all information about specialized files, including test data files, and information about each character property defined in the UCD.

Q: Whoa! Some of those file formats are pretty arcane and hard to parse. Can't you provide this character data in a format that can be parsed with standard tools?

A: Of course! Starting with Unicode 5.1, the entire UCD is also available in XML format. The XML version is available in the versioned directory for each release. The latest version of the XML files can always be found at: http://www.unicode.org/Public/UCD/latest/ucdxml/.

Q: Where is the documentation for the XML data representation?

A: Start with the readme.txt, which explains what each of the zipped XML data files contains: http://www.unicode.org/Public/UCD/latest/ucdxml/.
The detailed specification of the attributes and other conventions used in the XML can be found in Unicode Standard Annex #42, Unicode Character Database in XML

Q: Why don't you provide an XML Schema for the XML representation of the UCD?

A: We found that the development of a Relax NG schema (an ISO standard, by the way) is considerably simpler than the development of a W3C XML Schema. Furthermore, there are tools to convert from Relax NG to XML Schema (for example, trang, available from thaiopensource.com), should the need arise. It is also worth noting that an XML Schema is not required for the proper interpretation of the data for the Unicode Character Database, because there are no default values provided. [EM]

Q: Are there other FAQs which deal with Unicode character properties?

A: Yes, questions about particular character properties might be answered at Character Properties, Case Mappings & Names.

Q: What about older versions of the Unicode Character Database?

A: All older versions of the UCD, which are formally a part of earlier releases of the Unicode Standard, are permanently archived on the Unicode web site. They can be found by following the links to component listings for specific versions at: http://www.unicode.org/versions/enumeratedversions.html.