[Unicode]  Technical Reports
 

Proposed Draft Unicode Technical Report #43

A User’s Guide to the UniTangut Database

Author Richard Cook
Date 2006-12-12
This Version http://www.unicode.org/reports/tr43/tr43-0.html
Previous Version None
Latest Version http://www.unicode.org/reports/tr43/
Tracking Number 0

Summary

This document describes the organization and content of the UniTangut Database.

Status

This document is a Proposed Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

[Note to reviewers: This document describes the UniTangut.txt data set, proposed for addition to the UCD. The proposal and associated data have not yet been submitted, reviewed, revised, or accepted, and consequently this TR is a pre-proposed draft. This document was constructed as a merger of Unihan.html 5.0 and tr38-3.html, first by globally changing all references to UniHan to UniTangut, and then rewriting, rearranging, and adding content as needed. In general, please consider this document as a draft of a template which might also be more generally applied to ongoing revisions to and consolidation of Unihan documentation, and to documentation of mapping data for other large character sets (e.g. Evyptian Hieroglyphs). ]


Contents


1 Introduction

The UniTangut Database is the repository for the Unicode Consortium’s collective knowledge regarding the Tangut character block of the Unicode Standard. The UniTangut Database contains mapping data linking the encoded characters to the primary print-sources and legacy encodings. The UniTangut Database is modelled after the Unihan Database (documented in TR38), and employs the same structural conventions, allowing the same data management tools to be used on both data sets.

This document is a guide to the UniTangut Database, describing the mechanics of the database, the nature of its contents, and the status of its various fields. The UniTangut Database is a work in progress: existing data is being refined, and new data is being added on a regular basis.

The UniTangut Database exists in three forms, two of which are available to the public:

The structures and relations among these forms of the UniTangut Database are described in 2; general Property Types (Status and Category) are described in 3; and Property Metadata Types are outlined in 4.


2 Mechanics

2.1 Database design

The working copy of the UniTangut Database is maintained by the Unicode Consortium. The two public versions are reflections of this data at the time of a version release.

As with Unihan, the master (working) copy of the UniTangut data is stored in an SQL database with two main tables (joined on their tag fields):

For public release, the above two tables are exported to a pair of tab-delimited UTF-8 files:

Most UniTangut database fields in the master SQL database are made available in the public releases. Fields not part of a public release are of several types:

2.2 Web Access

When the UniTangut Database has been publicly released, the release version of the data serves as input to a second SQL database, used for the online browser-based query system. It is important to note that this searchable version of the database is identical in content to the version release. End-users using online browser-based query system do not query the working copy of the UniTangut Database.

The searchable web interface to Unicode’s Tangut data is available through the Main UniTangut Data Portal.

2.3 UniTangut.txt

The public UniTangut.txt property list file is UTF-8. The file consists of one or more header comment lines (/^#/) followed by lines of data; a trailing comment (/^#/) ends the whole file, giving the overall line-count of the file [including all comment lines]). The file has Unix line breaks (U+000A).

The UniTangut.txt text file consists of nearly two million bytes of data in roughly 100,000 lines, covering all 5,805 encoded Tangut characters.

Each line (record) of the file UniTangut.txt consists of three tab-separated fields.

The following table shows an example record from UniTangut.txt, for the example Code Point U+17000 (Field 1); each line provides a Property Value (Field 3) for a unique Property Tag (Field 2):

Field 1Field 2Field 3
Code PointProperty TagProperty Value
U+17000tB5FA40
U+17000tNevskyI-178
U+17000tNishida1-051
U+17000tPUAU+E000
U+17000tSN1
U+17000tSofronov1075
U+17000tTY0001
U+17000tTYBH5-6
U+17000tTYBS1
U+17000tTYP9
U+17000tTYYY1.43
U+17000tTYYZ53B18
U+17000tUNI17000
U+17000tWHYJ53.171
U+17000tWenHai1460
U+17000tXiaHan0100
U+17000tYTYL2126.10

 

The UniTangut Database consists of a number of fields containing data for each Tangut character in the Unicode Standard. The field names consist entirely of ASCII letters and digits with no spaces or other puncutation except for underscore. On the model of Unihan.txt (which for historical and perhaps other reasons uses a lower-case k [= Kanji?] field name prefix), all UniTangut.txt field names start with a lower-case t.

This general naming convention (/^[tk][A-Z_]+$/i) is admittedly unnecessary in an XML UCD, but nevertheless provides some redundancy in the tab-delimited text files, and also emphasizes the fact that UniTangut.txt is intended to be managed with trivial (or no) modification to existing Unihan.txt processing tools. This may prove useful in simplifying migration of these parts of the UCD to XML.

For most mapping sources, if multiple values are possible in Field 3, the values are separated by spaces (but see Section 4: Syntax). No Tangut character may have more than one instance of a given field associated with it, and no empty fields are included in the UniTangut.txt file. Each code point in the block may occur at the head (Field 1) of one or more records (lines), depending on the number of fields (tags) with records for that code point.

There is no formal limit on the lengths of any of the field values. Any Unicode characters may be used in the field values except for unescaped control characters (especially tab, newline, and carriage return). Most fields have a more restricted Syntax.

The data lines are sorted with code point as primary sort key, and field-name as secondary sort key. If the property value itself is structured, its values may also be sorted according to a sorting method detailed in the property description.

The header comment lines contain very limited metadata regarding the file itself, including the file name, version, date of production, and a pointer to the main documentation (→ you are here ←).

Ranges of Tangut code points valid for Field 1 of UniTangut.txt are listed in the following table:

Code point range Block name Release
U+17000 .. U+18715 TANGUTunassigned
U+18716 .. U+186FF TANGUT EXTENSION Aunproposed

Note that Tangut characters in the following ranges do not have mapping data in Field 1 of UniTangut.txt (though they may at some future time):

Code point range Block name Release
N.A. TANGUT RADICALS unproposed
N.A. TANGUT RADICALS SUPPLEMENT unproposed
N.A. TANGUT STROKE TYPES unproposed
N.A. TANGUT COMPATIBILITY CHARACTERS unproposed

 

2.4 UniTangut.xml

Future incarnations of the public UniTangut Database release may include a UniTangut.xml XML representation of the data and metadata.

3 Property Types: Status and Category

Each UniTangut Database field (a.k.a. tag, property) is classified according to the formal Status of this property within the Unicode Standard, as determined by the UTC.

Each field is also classified by usage Category, according to the purpose it serves.

We provide here a general discussion of these two basic classifications (UniTangut tags, by Status and by Category), followed by an overview of Property Metadata, and detailed descriptions of the individual Properties, alphabetically arranged.

Note that all data in the UniTangut Database has been donated to the Unicode Consortium, and that proofing, augmentation an dpublication of the data is an ongoing process, subject to available resources. If data satisfying a certain need is not currently present in the database, end-users are encouraged to contribute well-documented data for possible inclusion.

3.1 UniTangut Properties by Status

Each property has a formal Status, as determined by the UTC.

In the list of UniTangut properties (fields, tags) given below, each property is assigned a formal status. Only a few UniTangut properties (may eventually) correspond to Unicode Normative or Informative properties: most all are Provisional. For information on the meanings of the Normative, Informative and Provisional Status flags, see definitions D33, D35, and D36 in Chapter 3 Properties of Unicode 5.0 [U5.0]. For information on properties and on the general structure of the Unicode Character Database, see UCD.html.

StatusProperties with this status
Normativenone
Informativenone
Provisionalall

 

3.2 UniTangut Properties by Category

Each property is also assigned to one or more functional Category, according to presumed utility of the field data. We distinguish the following usage categories for fields.

CategoryCategory DescriptionProperties in this category
Dictionary Indices References to primary lexical treatments of this script entity. tTYYZ, tTY, tWenHai, tXiaHan, tSofronov, tKychanov, etc.
Dictionary-like Data Data derived from primary lexical sources, including phonologic, gloss, frequency, etc. tTYYY, tTYP, etc.
WG MappingsMappings established by a WG2 working group.pending (a “TRG” might expand this character set, add mappings, and resolve variant issues for non-extinct users)
Numeric ValuesThe numeric value(s) of a character with this property.pending
Other MappingsMappings to legacy ecodings.tB5, tPUA
Radical-Stroke CountsRadical (lexical classifier) and Residual Stroke-count assignmentstTYBS, tTYBH
Variant relationsMappings between encoded characters, establishing usage identity according to some usage authority.tHXM

 

4.0 UniTangut Properties in Detail

4.1 Property Metadata Types

For each field the alphabetical listing (4.2) gives the following information:

Property Metadata TypeDescription
TagThe immutable abbreviation serving as a unique key identifiying this property. Each tag matches regex /^t[A-Z_]+$/i. The tag is used in the UniTangut.txt file (and in the SQL database whence it derives) to mark each instance of this field. The concatenation of Unicode code point + tag uniquely identifies each record in the database.
StatusFormal UTC status, Normative, Informative, or Provisional, depending on whether it is a Normative part of the standard, an Informative part of the standard, or neither; see 3.1.
CategoryUsage classification; see 3.2.
AddedUnicode version in which this property first appeared.
ModifiedUnicode version in which this property was last modified; modification may be indicated as the result of change in any mutable Property Metadata category (i.e., Tag cannot change, but any of Status, Records, Description, and Values might. Not all types of modification need be noted here, but important ones relating to Status, Records, and Values will.
SyntaxConstraints on property values, as described by a regex; fields which allow multiple values will also specify the delimiter in the regex; Syntax is a Perl 5.8 regular expression describing the formal structure of an individual Value. For example, the Syntax for the tTYYZ field is /^\d{1,2}[AB]\d{1,2}$/, which means that this tag permits only single values, and that the value begins with one or two digits, followed by an A or B, followed by one or two digits. The syntax can be used to validate the contents of a field. The regex can be written to varying degrees of stringency, given ample time and testing.
RecordsTotal number of Unicode characters having a record for this property.
DescriptionDescription of the property, including a unique source identifier (bibliographic etc.), and notes of various types relevant to interpretation of the property data; the Description contains not only a description of what the field contains, but also source information, known limitations, methodology used in deriving the data, and so on.
ValuesThe actual property data per tag associated with each character in the UniTangut Database. These values are not included in 4.2 below, but must be gotten from the UniTangut Database itself.

 

As is the property data, so too the property metadata is a work in progress. Property Metadata types may be added, and existing types may be refined in the future. For example, bibliographic information may be extracted from the Description, and isolated in a new Bibliography property metadata type.

4.2 Property Metadata, Alphabetically by Property Tag

We now list the Properties of the UniTangut Database alphabetically, giving the above types of Property Metadata (4.1) for each.

Property Tag$tTag
Status$statusCategory$category
Added$addedModified$modified
Syntax$syntaxRecords$records
Description$description


References

[Feedback] http://www.unicode.org/reporting.html
For reporting errors and requesting information online.
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[Unicode] The Unicode Standard, Version 5.0
[Versions] Versions of the Unicode Standard
http://www.unicode.org/versions/
For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Modifications

This section indicates the changes introduced by each revision.

Revision 0


Copyright © 2006 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.