Technical Reports |
Author | Richard Cook |
Date | 2006-12-12 |
This Version | http://www.unicode.org/reports/tr43/tr43-0.html |
Previous Version | None |
Latest Version | http://www.unicode.org/reports/tr43/ |
Tracking Number | 0 |
This document describes the organization and content of the UniTangut Database.
This document is a Proposed Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
[Note to reviewers: This document describes the UniTangut.txt
data set, proposed for addition to the UCD. The proposal and associated data have not yet been submitted, reviewed, revised, or accepted, and consequently this TR is a pre-proposed draft. This document was constructed as a merger of Unihan.html 5.0 and tr38-3.html, first by globally changing all references to UniHan to UniTangut, and then rewriting, rearranging, and adding content as needed. In general, please consider this document as a draft of a template which might also be more generally applied to ongoing revisions to and consolidation of Unihan documentation, and to documentation of mapping data for other large character sets (e.g. Evyptian Hieroglyphs). ]
The UniTangut Database is the repository for the Unicode Consortium’s collective knowledge regarding the Tangut character block of the Unicode Standard. The UniTangut Database contains mapping data linking the encoded characters to the primary print-sources and legacy encodings. The UniTangut Database is modelled after the Unihan Database (documented in TR38), and employs the same structural conventions, allowing the same data management tools to be used on both data sets.
This document is a guide to the UniTangut Database, describing the mechanics of the database, the nature of its contents, and the status of its various fields. The UniTangut Database is a work in progress: existing data is being refined, and new data is being added on a regular basis.
The UniTangut Database exists in three forms, two of which are available to the public:
UniTangut.txt
and documentation, distributed as part of the Unicode Standard (public) The structures and relations among these forms of the UniTangut Database are described in 2; general Property Types (Status and Category) are described in 3; and Property Metadata Types are outlined in 4.
The working copy of the UniTangut Database is maintained by the Unicode Consortium. The two public versions are reflections of this data at the time of a version release.
As with Unihan, the master (working) copy of the UniTangut data is stored in an SQL database with two main tables (joined on their tag fields):
For public release, the above two tables are exported to a pair of tab-delimited UTF-8 files:
UniTangut.txt
file, and also serves as the basis for statistical information included in the Property Metadata; Most UniTangut database fields in the master SQL database are made available in the public releases. Fields not part of a public release are of several types:
When the UniTangut Database has been publicly released, the release version of the data serves as input to a second SQL database, used for the online browser-based query system. It is important to note that this searchable version of the database is identical in content to the version release. End-users using online browser-based query system do not query the working copy of the UniTangut Database.
The searchable web interface to Unicode’s Tangut data is available through the Main UniTangut Data Portal.
The public UniTangut.txt
property list file is UTF-8. The file consists of one or more header comment lines (/^#/
) followed by lines of data; a trailing comment (/^#/
) ends the whole file, giving the overall line-count of the file [including all comment lines]). The file has Unix line breaks (U+000A).
The UniTangut.txt
text file consists of nearly two million bytes of data in roughly 100,000 lines, covering all 5,805 encoded Tangut characters.
Each line (record) of the file UniTangut.txt
consists of three tab-separated fields.
/^U\+[0-9A-F]{5}$/
; that is, there are five hex digits following the U+ prefix).The following table shows an example record from UniTangut.txt
, for the example Code Point U+17000 (Field 1); each line provides a Property Value (Field 3) for a unique Property Tag (Field 2):
Field 1 | Field 2 | Field 3 |
---|---|---|
Code Point | Property Tag | Property Value |
U+17000 | tB5 | FA40 |
U+17000 | tNevsky | I-178 |
U+17000 | tNishida | 1-051 |
U+17000 | tPUA | U+E000 |
U+17000 | tSN | 1 |
U+17000 | tSofronov | 1075 |
U+17000 | tTY | 0001 |
U+17000 | tTYBH | 5-6 |
U+17000 | tTYBS | 1 |
U+17000 | tTYP | 9 |
U+17000 | tTYYY | 1.43 |
U+17000 | tTYYZ | 53B18 |
U+17000 | tUNI | 17000 |
U+17000 | tWHYJ | 53.171 |
U+17000 | tWenHai | 1460 |
U+17000 | tXiaHan | 0100 |
U+17000 | tYTYL | 2126.10 |
The UniTangut Database consists of a number of fields containing data for each Tangut character in the Unicode Standard. The field names consist entirely of ASCII letters and digits with no spaces or other puncutation except for underscore. On the model of Unihan.txt
(which for historical and perhaps other reasons uses a lower-case k
[= Kanji?] field name prefix), all UniTangut.txt
field names start with a lower-case t
.
This general naming convention (/^[tk][A-Z_]+$/i
) is admittedly unnecessary in an XML UCD, but nevertheless provides some redundancy in the tab-delimited text files, and also emphasizes the fact that UniTangut.txt
is intended to be managed with trivial (or no) modification to existing Unihan.txt
processing tools. This may prove useful in simplifying migration of these parts of the UCD to XML.
For most mapping sources, if multiple values are possible in Field 3, the values are separated by spaces (but see Section 4: Syntax). No Tangut character may have more than one instance of a given field associated with it, and no empty fields are included in the UniTangut.txt
file. Each code point in the block may occur at the head (Field 1) of one or more records (lines), depending on the number of fields (tags) with records for that code point.
There is no formal limit on the lengths of any of the field values. Any Unicode characters may be used in the field values except for unescaped control characters (especially tab, newline, and carriage return). Most fields have a more restricted Syntax.
The data lines are sorted with code point as primary sort key, and field-name as secondary sort key. If the property value itself is structured, its values may also be sorted according to a sorting method detailed in the property description.
The header comment lines contain very limited metadata regarding the file itself, including the file name, version, date of production, and a pointer to the main documentation (→ you are here ←).
Ranges of Tangut code points valid for Field 1 of UniTangut.txt
are listed in the following table:
Code point range | Block name | Release |
---|---|---|
U+17000 .. U+186AC | TANGUT | unassigned |
U+186AD .. U+186FF | TANGUT EXTENSION A | unproposed |
Note that Tangut characters in the following ranges do not have mapping data in Field 1 of UniTangut.txt
(though they may at some future time):
Code point range | Block name | Release |
---|---|---|
N.A. | TANGUT RADICALS | unproposed |
N.A. | TANGUT RADICALS SUPPLEMENT | unproposed |
N.A. | TANGUT STROKE TYPES | unproposed |
N.A. | TANGUT COMPATIBILITY CHARACTERS | unproposed |
Future incarnations of the public UniTangut Database release may include a UniTangut.xml
XML representation of the data and metadata.
Each UniTangut Database field (a.k.a. tag, property) is classified according to the formal Status of this property within the Unicode Standard, as determined by the UTC.
Each field is also classified by usage Category, according to the purpose it serves.
We provide here a general discussion of these two basic classifications (UniTangut tags, by Status and by Category), followed by an overview of Property Metadata, and detailed descriptions of the individual Properties, alphabetically arranged.
Note that all data in the UniTangut Database has been donated to the Unicode Consortium, and that proofing, augmentation an dpublication of the data is an ongoing process, subject to available resources. If data satisfying a certain need is not currently present in the database, end-users are encouraged to contribute well-documented data for possible inclusion.
Each property has a formal Status, as determined by the UTC.
In the list of UniTangut properties (fields, tags) given below, each property is assigned a formal status. Only a few UniTangut properties (may eventually) correspond to Unicode Normative or Informative properties: most all are Provisional. For information on the meanings of the Normative, Informative and Provisional Status flags, see definitions D33, D35, and D36 in Chapter 3 Properties of Unicode 5.0 [U5.0]. For information on properties and on the general structure of the Unicode Character Database, see UCD.html.
Status | Properties with this status |
---|---|
Normative | none |
Informative | none |
Provisional | all |
Each property is also assigned to one or more functional Category, according to presumed utility of the field data. We distinguish the following usage categories for fields.
Category | Category Description | Properties in this category |
---|---|---|
Dictionary Indices | References to primary lexical treatments of this script entity. | tTYYZ, tTY, tWenHai, tXiaHan, tSofronov, tKychanov, etc. |
Dictionary-like Data | Data derived from primary lexical sources, including phonologic, gloss, frequency, etc. | tTYYY, tTYP, etc. |
WG Mappings | Mappings established by a WG2 working group. | pending (a “TRG” might expand this character set, add mappings, and resolve variant issues for non-extinct users) |
Numeric Values | The numeric value(s) of a character with this property. | pending |
Other Mappings | Mappings to legacy ecodings. | tB5, tPUA |
Radical-Stroke Counts | Radical (lexical classifier) and Residual Stroke-count assignments | tTYBS, tTYBH |
Variant relations | Mappings between encoded characters, establishing usage identity according to some usage authority. | tHXM |
For each field the alphabetical listing (4.2) gives the following information:
Property Metadata Type | Description |
---|---|
Tag | The immutable abbreviation serving as a unique key identifiying this property. Each tag matches regex /^t[A-Z_]+$/i . The tag is used in the UniTangut.txt file (and in the SQL database whence it derives) to mark each instance of this field. The concatenation of Unicode code point + tag uniquely identifies each record in the database. |
Status | Formal UTC status, Normative, Informative, or Provisional, depending on whether it is a Normative part of the standard, an Informative part of the standard, or neither; see 3.1. |
Category | Usage classification; see 3.2. |
Added | Unicode version in which this property first appeared. |
Modified | Unicode version in which this property was last modified; modification may be indicated as the result of change in any mutable Property Metadata category (i.e., Tag cannot change, but any of Status, Records, Description, and Values might. Not all types of modification need be noted here, but important ones relating to Status, Records, and Values will. |
Syntax | Constraints on property values, as described by a regex; fields which allow multiple values will also specify the delimiter in the regex; Syntax is a Perl 5.8 regular expression describing the formal structure of an individual Value. For example, the Syntax for the tTYYZ field is /^\d{1,2}[AB]\d{1,2}$/ , which means that this tag permits only single values, and that the value begins with one or two digits, followed by an A or B, followed by one or two digits. The syntax can be used to validate the contents of a field. The regex can be written to varying degrees of stringency, given ample time and testing. |
Records | Total number of Unicode characters having a record for this property. |
Description | Description of the property, including a unique source identifier (bibliographic etc.), and notes of various types relevant to interpretation of the property data; the Description contains not only a description of what the field contains, but also source information, known limitations, methodology used in deriving the data, and so on. |
Values | The actual property data per tag associated with each character in the UniTangut Database. These values are not included in 4.2 below, but must be gotten from the UniTangut Database itself. |
As is the property data, so too the property metadata is a work in progress. Property Metadata types may be added, and existing types may be refined in the future. For example, bibliographic information may be extracted from the Description, and isolated in a new Bibliography property metadata type.
We now list the Properties of the UniTangut Database alphabetically, giving the above types of Property Metadata (4.1) for each.
Property Tag | $tTag | ||
---|---|---|---|
Status | $status | Category | $category |
Added | $added | Modified | $modified |
Syntax | $syntax | Records | $records |
Description | $description |
[Feedback] | http://www.unicode.org/reporting.html For reporting errors and requesting information online. |
[Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[Unicode] | The Unicode Standard, Version 5.0 |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them. |
This section indicates the changes introduced by each revision.
Revision 0
Zeroth version
Copyright © 2006 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.