[Unicode]   Technical Reports

Proposed Draft Unicode Technical Report #43

A User’s Guide to the UniTangut Database

Author Richard Cook
Date 2007-09-21
This Version L2/07-290
Previous Version L2/07-158
Latest Version PDUTR #43
Tracking Number 1


This document describes the organization and content of the UniTangut Database.


This document is a Proposed Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

[Note to reviewers : This document describes the UniTangut.txt data set, proposed and accepted by UTC for inclusion in a future version of the Unicode Standard. The proposed Tangut characters and associated mapping data are not yet a formal part of the Unicode Standard, and consequently this TR is a proposed draft. This document was constructed initially as a merger of Unihan.html 5.0 and tr38-3.html, first by globally changing all references to UniHan to UniTangut, and then rewriting, rearranging, and adding content as needed. Please consider this document as a draft of a template which might also be more generally applied to ongoing revisions to and consolidation of UniHan documentation, and to documentation of mapping data for other large character sets. ]


1 Introduction

Tangut is the English name for the extinct language and writing system of the ancient Tangut people of central China. In Chinese the name is 西夏 Xīxià. Tangut civilization (with capital in the region of modern 寧夏銀川 Níngxià Yínchuān [N38.4,E106.3]) was conquered by the Mongols in 1227. Tangut characters were in use for a total of less than 500 years (1036-1502), including some 300 years of classical use after the Mongol conquest. See L2/07-289 for more background information on Tangut Script.

Tangut is also the name of the UCS block of Tangut characters (U+17000..U+18715).

The UniTangut Database is the repository of the Unicode Consortium’s collective knowledge regarding the Tangut character block of the Unicode Standard. The UniTangut Database stores various types of Property Data linking (mapping) the encoded characters to the primary print-sources and legacy encodings whence they derive.

The UniTangut Database is modelled after the UniHan Database (documented in TR38), and employs the same structural conventions, allowing the same data management tools to be used on both data sets.

The present document (TR43) is a guide to the UniTangut Database, describing the mechanics of the database, the nature and the status of its contents.

The UniTangut Database is a work in progress: existing data is being refined, and new data is being added on a regular basis. All data in the UniTangut Database has been donated to the Unicode Consortium. Proofing, augmentation and publication of the data is an ongoing process, subject to available resources. If data satisfying a certain need is not currently present in the database, end-users are encouraged to contact the Unicode Consortium, to contribute well-documented data for possible inclusion.

The UniTangut Database exists in three forms, two of which are available to the public:

In this document: the structures and relations among these forms of the UniTangut Database are described in Mechanics Section 2; general Property Types (Status and Category) are described in Section 3; and Property Metadata Types are outlined in Section 4.

2 Mechanics

2.1 Database design

The working copy of the UniTangut Database is maintained by the Unicode Consortium. The two public versions are reflections of this data at the time of a version release.

As with UniHan, the master (working) copy of the UniTangut data lives in an SQL database with two main tables, joined on their Tag fields (a Tag is a short abbreviation serving as the unique identifier of this Property):

For public release, the above two tables are exported to a pair of tab-delimited UTF-8 files:

Most UniTangut database Properties in the master SQL database are made available in the public releases. Properties not part of a public release are of several types:

2.2 Web Access

When the UniTangut Database has been publicly released, the release version of the data serves as input to a second SQL database, used for the online browser-based query system. It is important to note that this searchable version of the database is identical in content to the version release. End-users using the online browser-based query system do not query the working copy of the UniTangut Database.

An online search interface to Unicode’s Tangut data is available at the Main UniTangut Data Portal.

2.3 UniTangut.txt

The public UniTangut.txt property list file has UTF-8 encoding form (without BOM), and Unix line breaks (U+000A). The file consists of one or more header comment lines (/^#/) followed by lines of data; a trailing comment (/^#/) ends the whole file, giving the overall line-count of the file [including all comment lines]). The header comment lines contain very limited meta-data regarding the file itself, including the file name, version, date of production, and a pointer to this documentation file (TR 43).

In the original release, the UniTangut.txt text file consists of some 5 million bytes of data in ~200,000 lines, covering all 5,910 encoded Tangut characters.

• The latest version of the UniTangut.txt property list file is available as part of the Unicode Character Database (UCD).

Each line of the file UniTangut.txt consists of three tab-separated columns (fields), numbered (1..3) for the present discussion. The intersection of a given column and a given row (line) is a cell. The following table lists each of the three cell-types, and gives a brief description of its contents.

#Column NameColumn Description
Code Point
Each Column 1 cell contains exactly one Unicode Code Point in the Tangut Block, valid for this release, expressed in U+ prefixed uppercase hexadecimal form. In the original data release, the following regex holds true: /^U\+([0-9A-F]{5})$/; hex($1) >= 0x17000 && hex($1) <= 0x18715; Each code point in the block may occur at the head (i.e., in Column 1) of one or more records (lines), depending on the number of different Property Tags with records for that code point.
2Property TagEach Column 2 cell contains a Property Tag, i.e. an abbreviation serving as a unique key identifying this Property, and indicating the source or type of information occurring in the corresponding Column 3 cell (in the same line), for the code point given in the corresponding Column 1 cell (in that same line). All values valid for Column 2 are Tags documented in the present document (3, 4), and no other values may occur in Column 2. Of the valid Tags, some may not be included for a given code point, since empty Column 3 values are not permitted. No Tangut character may have more than one instance of a given Property Tag associated with it. The Property Tag is used in the UniTangut.txt file (and in the SQL database whence it derives) to mark each instance of this Property for each Tangut character in the Unicode Standard: the concatenation of Unicode Code Point + Tag uniquely identifies each record in the database. On the model of Unihan.txt (which for historical reasons uses a lower-case k [= Kanji?] Property Tag prefix), each UniTangut.txt Property Tag starts with a lower-case t, and consists entirely of ASCII letters and digits with no spaces or other puncutation except for underscore (see 4.1). This naming convention provides additional structure in the tab-delimited text files (unnecessary in XML components of the UCD): the Property Tags cannot be confused with valid English words (such as might occur in Property Values), and so their unique form facilitates simple searches on the tab-delimited text file. The UniTangut.txt data file may be managed with trivial (or no) modification to existing Unihan.txt processing tools.
3Property ValueEach Column 3 cell contains the Property Value (in UTF-8) deriving from the source indicated by the Property Tag given in the corresponding Column 2 cell (in the same line), for the code point given in the corresponding Column 1 cell (in that same line). There is no formal limit on the length of any Property Value. These values take forms which are Property-specific, as described in the present document (4). No empty values are permitted in Column 3. Most any Unicode characters may occur in a Property Value except for unescaped control characters (especially tab, newline, and carriage return). However, most Property Values have a more restricted Syntax. If multiple values are possible in Column 3, the values are typically separated by “•” U+2022 BULLET, a character which would not otherwise occur in Property Values. Each Property Value may however have its own syntax requirements (see Section 4: Syntax).

The data lines of the UniTangut.txt data file are sorted with Unicode Code Point (Column 1) as primary sort key, and Property Tag (Column 2) as secondary sort key. If the Property Value (Column 3) itself is structured, its values may also be sorted according to a sorting method detailed in the Property Description.

The following table shows the properties for the first Tangut character in the original release of the UniTangut.txt data file. Each row of the table corresponds to one line of the UniTangut.txt data file. In each row, the Code Point U+17000 in Column 1 is followed in Column 2 by a Property Tag (unique for this code point), followed by the associated Property Value in Column 3:

Column 1Column 2Column 3
Code PointProperty TagProperty Value
U+17000tLFW1997Num1 [cardinal]
U+17000tLFW1997noteHXM rad?

Ranges of Tangut code points valid for Column 1 of UniTangut.txt are listed in the following table:

Code point range Block name Release
U+17000 .. U+18715 TANGUTproposed
U+18716 .. U+1871F TANGUTunassigned
U+18716 .. U+186FF TANGUT EXTENSION Aunproposed

Note that Tangut characters in the following ranges do not have mapping data in Column 1 of UniTangut.txt (though they may at some future time):

Code point range Block name Release


2.4 UniTangut.xml

Future incarnations of the public UniTangut Database release may include UniTangut.xml XML representations of the data and metadata.

3 Property Types: Status and Category

The UniTangut database stores various types of Unicode Character Property Data, linking (mapping) the encoded characters to the primary print-sources and legacy encodings whence they derive. Unicode Character Property Data is key to the Unicode Standard, since this meta-data frames each character code point in meaningful context, determining the rights and wrongs of software handling and end-user usage in any particular case.

A Unicode Code Point value is assigned to each unitary legacy orthographic element in its migration to the universal digital encoding. Representative glyphs formally associated with a code point (in the cells of a single- or multi-column code chart) provide the next higher layer of property data. The code point (number) and associated representative glyph (image) together constitute a fundamental unit of information in Unicode character encoding. But this information alone may be insufficient for some usages, especially as regards scripts with difficult rendering issues, or long histories and complex interrelations. Representative glyphs in the single-column code charts published in the Unicode Standard provide a sort of generalized (abstract, low-res) representation of the character. Multi-column charts are higher-resolution, showing particular character forms extracted from particular source documents contributing to the unified encoding. Since Unicode consolidates under a single code point characters derivative of various sources (and sometimes with divergent usages and appearances, variant stroke types/counts, competing component analyses, etc.), Property Data is the bread-crumb-trail for tracking each abstract character out of the enchanted forest and back to the specific concrete instances in the original sources. This is valuable information, useful to implementers of basic block support, and essential to specialist end-users who need to develop their own usage protocols beyond those defined by the Unicode Standard.

In order to assist both implementers and end-users, the Unicode Standard broadly categorizes Property Data. Each UniTangut Database Property is classified according to its formal Status within the Unicode Standard, as determined by the UTC. And each Property is also classified by usage Category, according to the purpose (or purposes) it may serve.

We provide here a general discussion of these two classifications (UniTangut Properties, by Status and by Category), followed by an overview of Property Metadata, and detailed descriptions of the individual Properties, alphabetically arranged.

Note again that all data in the UniTangut Database has been donated to the Unicode Consortium, and that proofing, augmentation and publication of the data is an ongoing process, subject to available resources. If data satisfying a certain need is not currently present in the database, end-users are encouraged to contribute well-documented data for possible inclusion.

3.1 UniTangut Properties by Status

Each Character Property has a formal Status, as determined by the UTC.

In the list of UniTangut properties given below, each Property is assigned a formal Status. Only a few UniTangut properties (may eventually) correspond to Unicode Normative or Informative properties: most all are Provisional. For information on the meanings of the Normative, Informative and Provisional Status flags, see definitions D33, D35, and D36 in Chapter 3 Properties of Unicode 5.0 [U5.0]. For more information on properties and on the general structure of the Unicode Character Database, see UCD.html.

StatusProperties with this status


3.2 UniTangut Properties by Usage Category

Each Character Property is also assigned to a functional Category (or to multiple functional categories), according to presumed utility of the Property Data. We distinguish the following usage categories for properties.

CategoryCategory DescriptionProperties in this category
Dictionary Indices References to primary lexical treatments of this script entity. tHXM2004, tHXM2004Page, tLFW1986, tLFW1986YZ, tLFW1997, tLFW1997FCC, tLFW1997Index, tLFW1997IndexPR, tNevsky, tNishida, tSofronov, tUniMCCC, tWenhai, tWenhaiYanjiu, tYitongYilei
Dictionary-like Data Data derived from primary lexical sources, including phonologic, semantic, usage classes and statistics, etc. tHXM2004Usage, tLFW1986Init, tLFW1986Rhyme, tLFW1986note, tLFW1997GlossChi, tLFW1997GlossEng, tLFW1997Init, tLFW1997Phon, tLFW1997Rhyme, tLFW1997note
Numeric ValuesThe numeric value(s) of a character with this Property. tLFW1997Num
Other MappingsMappings to legacy ecodings. tHXM2004Order, tHXM2004PUA, tLFW1986B5, tLFW1986PUA, tLFW1997PUA
Radical-Stroke CountsRadical (lexical classifier) and Residual Stroke-Count assignments tHXM2004RS, tHXM2004RadBrk, tLFW1986Rad, tLFW1986ResStr, tLFW1997Rad, tLFW1997ResStr
Variant relationsMappings between encoded characters, establishing usage identity according to some usage authority. tHXM2004TL, tHXM2004VC, tLFW1997VC, tLFW1997vars, tUniRepGlyph
WG MappingsMappings contributed by or codified in association with National Body representatives to a WG2 working group. tUnicode

Some usage categories listed above may be subject to change, and new categories may be added in the future.

4.0 UniTangut Properties in Detail

4.1 Property Metadata Types

For each Property Tag the alphabetical listing (4.2) includes the following types of information:

Property Metadata TypeMutabilityDescription
Property TagImmutableThe Property Tag is an abbreviation serving as a unique key identifiying this Property in the UniTangut.txt file (and in the SQL database whence it derives; see 2.1) to mark each instance of this Property. Each Property Tag matches regex /^t[A-Z][A-Za-z0-9_]+$/ (though in actual practice, Tags are of finite length; see Syntax below). The concatenation of Unicode code point + Tag uniquely identifies each record in the database (see also 2.3).
StatusMutableFormal UTC Status is one of three: Normative, Informative, or Provisional, depending on whether it is a Normative part of the standard, an Informative part of the standard, or neither (see 3.1).
CategoryMutableUsage classification (see 3.2).
AddedImmutableUnicode version [and date] in which this Property first appeared.
ModifiedMutableUnicode version [and date] in which this Property was last modified; modification may be indicated as the result of change in any mutable Property Metadata category (i.e., Tag cannot change, but any of Status, Records, Description, and Values might. Not all types of modification need be noted here, but only important ones relating to Status, Records, and Values).
SyntaxMutableConstraints on Property Values, as described by a regular expression (regex); Properties which allow multiple values will also specify the delimiter in the regex; Syntax is a Perl 5.8 compatible regular expression (PCRE) describing the formal structure of an individual Property Value. For example, the Syntax for the tLFW1986YZ Property is (at the first approximation) /^\d{1,2}[AB]\d{1,2}$/, which means that this Property permits only single values (in each Property Value cell), and that each such value begins with one or two digits, followed by an A or B, followed by one or two digits. The Syntax regex can be used to validate the Property Value, and can be written to varying degrees of stringency, according to available resources. (Regexes in this draft have not all been validated.)
RecordsMutableTotal number of Unicode characters having (a record for) this Property.
DescriptionMutableDescription of the Property, including a unique source identifier (bibliographic etc.), and notes of various types relevant to interpretation of the Property Data, including known limitations, methodology used in deriving the data, and so on.
ValuesMutableThe actual Property Data per Property Tag associated with each character in the UniTangut Database. These values are not included in 4.2 below, but must be gotten from a version of the UniTangut Database itself (2.1).

As is the Property Data, so too the Property Metadata is a work in progress. Property Metadata types may be added, and existing types may be refined in the future. For example, bibliographic information may be extracted from the Description, and isolated in a new Bibliography property metadata type.

4.2 Property Metadata, Alphabetically by Property Tag

The table below lists Properties of the UniTangut Database sorted alphabetically by Property Tag, giving the above types of Property Metadata (4.1) for each.

Property Tags: tHXM2004, tHXM2004Order, tHXM2004PUA, tHXM2004Page, tHXM2004RS, tHXM2004RadBrk, tHXM2004TL, tHXM2004Usage, tHXM2004VC, tLFW1986, tLFW1986B5, tLFW1986Init, tLFW1986PUA, tLFW1986Rad, tLFW1986ResStr, tLFW1986Rhyme, tLFW1986YZ, tLFW1986note, tLFW1997, tLFW1997FCC, tLFW1997GlossChi, tLFW1997GlossEng, tLFW1997Index, tLFW1997IndexPR, tLFW1997Init, tLFW1997Num, tLFW1997PUA, tLFW1997Phon, tLFW1997Rad, tLFW1997ResStr, tLFW1997Rhyme, tLFW1997VC, tLFW1997note, tLFW1997vars, tNevsky, tNishida, tSofronov, tUniMCCC, tUniRepGlyph, tUnicode, tWenhai, tWenhaiYanjiu, tYitongYilei.

Property TagtHXM2004
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/^(\d{4})$/; $1 >= 1 && $1 <= 6066;Records5861
Description韓小忙 Hán Xiǎománg (2004): 《西夏文正字研究》. 西安: 陝西師範大學. [On Tangut Orthography; Xi’an: Shanxi Normal Univ.; Ph.D. dissertation K246.3 H211.7, directed by 李范文 Lǐ Fànwén (see tLFW1997, tLFW1986)]. HXM undertakes a comprehensive and systematic collation of Tangut characters, based on nine Tangut dictionaries (《同音》, 《文海寶韻》, 《同音文海寶韻合編》, 《番漢合時掌中珠》, 《三才雜字》, 《纂要》, 《同義》, 《五音切韻》, 《新集碎金置掌文》), and catalogues a total of 6,066 forms, including 169 variants, 36 errors, and 5,861 unique ‘standard-style characters’ (“正字” zhèngzì ‘orthography’). In addition to the primary source mappings, this work contains mappings to Lǐ (1997) and Sofronov (1968). All field names beginning with tHXM2004 relate to this source. The font “HXM.ttf” (Column Y in the Multi-Column Code Chart) was produced from hi-res scans of this source; a lo-res scan of the entire source also exists in PDF.
Property TagtHXM2004Order
StatusProvisionalCategoryOther Mappings
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tHXM2004 for the full bibliographic reference.] This field controlled the ordering of the UCS repertory, i.e. it determined the Unicode code-point assignments in the block. Here we outline the relation of the tHXM2004 ordering to the UCS order. Values with a decimal point are virtual tHXM2004 assignments, or reflect a modification of the tHXM2004 order. The UCS ordering is ~99.9% tHXM2004, but the trailing ~0.1% is the reason the proposal says that the ordering “derives from”, rather than “exactly follows” the tHXM2004 source. First, tHXM2004 treats the whole complex character U+17000 ‘one’ as radical, since it happens to occur as left-side component in one other relatively rare character (U+17D9C ‘a surname’). Because of this, both characters appear way down in his list (along with the six-stroke radicals). A more logical place for this character ‘one’, and the place in which it appears in most radical indexes, is at the beginning of the character set. Its top horizontal stroke is the more obvious radical, and the associated character (U+17D9C ‘a surname’) has a perfectly good alternate radical assignment (depending only on span of the top line of the left-most component). The tHXM2004 character set begins with ‘poison’ rather than ‘one’, but clearly ‘one’ at the beginning is more intuitive, and better for UCS. This is the major exception relative to the tHXM2004 source ordering. For other apparent differences in the “Column Y” Multi-Column Code Chart ordering, relative to the UCS order: HXM organizes his characters into three general categories (see tHXM2004Usage). Briefly summarizing, the first category includes the bulk, 5,784 types total, all of the most important characters. The other two categories include the rest, poorly attested in the nine native dictionaries. Each of these three categories is organized separately using the HXM radical system. Since they are enumerated separately (all of Category 1 is listed before Category 2 ...), when the three categories were merged into a single set, some of the sequential HXM serial numbers become apparently disordered. Finally, “virtual assignments” means that the few characters not in tHXM2004, and not unifiable variants of characters in tHXM2004 (unifiable according to tLFW1997), were assigned positions in the tHXM2004 repertory as if they had occurred in that repertory, according to the same principles determining the ordering of the tHXM2004 repertory. These characters all comprise a virtual “Category 4”.
Property TagtHXM2004PUA
StatusProvisionalCategoryOther Mappings
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/^([0-9a-f]{4})$/i; hex($1) >= 0xE000 && hex($1) <= 0xF7B1;Records5861
Description[See tHXM2004 for the full bibliographic reference.] This field gives the Unicode Private Use Area (PUA) mappings of the glyphs, sequentially assigned according to the tHXM2004 index values. This is the encoding of “HXM.ttf” (“Column Y” in the Multi-Column Code Chart).
Property TagtHXM2004Page
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/^(\d{1,3})$/; $1 >= 14 && $1 <= 348;Records5861
Description[See tHXM2004 for the full bibliographic reference.] This field gives the number of the page in tHXM2004 on which character is found.
Property TagtHXM2004RS
StatusProvisionalCategoryRadical-Stroke Counts
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tHXM2004 for the full bibliographic reference.] This field gives the stroke count (1..16) of the “radical” (tHXM2004RadBrk) for characters with tHXM2004Usage value == 1, and otherwise (tHXM2004Usage value > 1) the four-digit index (tHXM2004VC) of the “radical” to which it is assigned.
Property TagtHXM2004RadBrk
StatusProvisionalCategoryRadical-Stroke Counts
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tHXM2004 for the full bibliographic reference.] This field marks the transitions between “radicals” in tHXM2004. All left-hand, top- and bottom-spanning components are “radicals” in his system, and hence the relatively large set of (474) “radicals”.
Property TagtHXM2004TL
StatusProvisionalCategoryVariant Relations
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tHXM2004 for the full bibliographic reference.] This field marks with asterisk (adjacent) members of the same tHXM2004VC (TL = 同類), or marks with “???” characters having only virtual tLFW1997 mappings.
Property TagtHXM2004Usage
StatusProvisionalCategoryDictionary-like Data
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tHXM2004 for the full bibliographic reference.] This field gives the tHXM2004 usage class. There are three classes (1,2,3) according to commonness in the Tangut sources, “1” being most common, and 3 being least common. Each of these usage classes is arranged separately by Radical/Stroke (see tHXM2004RS). • Category 1 has 5,981 forms (tokens, including variants), distributed over 5,784 types (there are 197 variants; 5,784 + 197 = 5,981); the characters in this category are rather well attested in the surviving literature (occurring in two or more sources, including 《同音》TY and 《文海》WH), and have mappings to Lǐ (1997) and Sofronov (1968). Only 2 characters in this class lack mappings to Lǐ (1997), HXM 1066 = varclass 1015 (《同音》乙 47B21) and 1877 = varclass 1805 (雜字 06B6, 《同義》甲 0916.01). • Category 2 contains only 22 poorly attested “孤證” characters, occurring only once in a version of a primary lexical source (《同音》, 《文海》, 《合編》); 7 of these have mappings to Lǐ (1997), 2 of these 7 also having Sofronov (1968) mappings; there are no variant sets in this class. A total of 15 characters in this class lack mappings to Lǐ (1997): HXM varclasses: 5786, 5788, 5790, 5792, 5794, 5796, 5797, 5798, 5799, 5800, 5802, 5803, 5804, 5805, 5806). • Category 3 includes a total of 55 rare graphs, attested only in secondary lexical sources, or in commentaries; only 3 of these graphs have mappings to Lǐ (1997), and none has a mapping to Sofronov (1968); there are 7 variant pairs in this class, including mis-spellings. A total of 52 characters in this class lack mappings to Lǐ (1997): HXM varclasses: 5807, 5808, 5809, 5810, 5811, 5812, 5813, 5814, 5815, 5816, 5818, 5819, 5820, 5821, 5822, 5823, 5824, 5825, 5826, 5827, 5828, 5829, 5830, 5831, 5833, 5834, 5835, 5836, 5837, 5838, 5839, 5840, 5841, 5842, 5843, 5844, 5845, 5846, 5847, 5848, 5849, 5850, 5851, 5852, 5853, 5854, 5855, 5857, 5858, 5859, 5860, 5861). See also tHXM2004Order for comments on a virtual “Category 4”.
Property TagtHXM2004VC
StatusProvisionalCategoryVariant Relations
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/^(\d{4})$/; $1 >= 1 && $1 <= 5861;Records5861
Description[See tHXM2004 for the full bibliographic reference.] This field gives the tHXM2004 varclass (variant class) assignment of the glyph. Note that four tHXM2004 glyphs (tokens) were each assigned by him to two tHXM2004VC varclasses (characters, types), and these are intentionally disunified in the UCS. tHXM2004 1820 → tHXM2004VC [1748,5755]; 5950 → [1748,5755]; 2963 → [2823,2849]; 2934 → [2823,2849]. See tLFW1997VC [2.4664, 2.5746], i.e. tLFW1997 [4664,4665,5746,5982]. See also tUniMCCC.
Property TagtLFW1986
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/^\d{4}$/i; ($1 >= 1 && $1 <= 5815) or ($1 >= 9990 && $1 <= 9993)Records5782
Description李范文 Lǐ Fànwén (1986): 《同音研究》Tóngyīn Yánjiū. 寧夏: 寧夏人民出版社. [‘Homophones’ Research.] The native Tangut rhyme book Tóngyīn (TY; see tHXM2004 for manuscript collation) enumerates a total of 5,815 Tangut characters, which is reduced to 5,809 by elimination of duplicates. The font “xixia.ttf” (Column W in the Multi-Column Code Chart) was produced from scans of this source (see tLFW1986PUA, and tLFW1986note). All field names beginning with tLFW1986 relate to this source.
Property TagtLFW1986B5
StatusProvisionalCategoryOther Mappings
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1986 for the full bibliographic reference.] This field contains Academia Sinica’s Big5-based encoding of this character (see also tLFW1986PUA).
Property TagtLFW1986Init
StatusProvisionalCategoryDictionary-like Data
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1986 for the full bibliographic reference.] This field assigns one of nine phonological classes to the syllable’s initial (cp. tLFW1997Init; see also tLFW1986Rhyme).
Property TagtLFW1986PUA
StatusProvisionalCategoryOther Mappings
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/^[0-9a-f]{4}$/i; hex($1) >= 0xE000 && hex($1) <= 0xF6B0;Records5782
Description[See tLFW1986 for the full bibliographic reference.] This field contains Academia Sinica’s Unicode Private Use Area (PUA) encoding of tLFW1986B5. The “xixia.ttf” (Column W) font in the Multi-Column Code Chart derives from this font, correcting several glyph errors, and adding four missing TY characters (see “LFW1986notes”).
Property TagtLFW1986Rad
StatusProvisionalCategoryRadical-Stroke Counts
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/^\d{1,3}$/; $1 >= 1 && $1 <= 364;Records5782
Description[See tLFW1986 for the full bibliographic reference.] This field assigns the tLFW1986 radical (Radical/Stroke index, LFW 1986:771-845; cp. LFW1997Rad).
Property TagtLFW1986ResStr
StatusProvisionalCategoryRadical-Stroke Counts
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1986 for the full bibliographic reference.] This field contains the Residual Stroke Count, as a single value or as a hyphenated range (see LFW1986Rad; cp. LFW1997ResStr).
Property TagtLFW1986Rhyme
StatusProvisionalCategoryDictionary-like Data
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1986 for the full bibliographic reference.] This field assigns a phonological class to the syllable’s final (+tone). The first digit gives the tone class, and number(s) following the decimal point indicate the final class (cp. tLFW1997Rhyme; see also tLFW1986Init).
Property TagtLFW1986YZ
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1986 for the full bibliographic reference.] This field gives the grid coordinates for the character on the page of the TY hand-copy; a letter “A” or “B” indicates a right- or left-hand page (respectively), and the digit(s) before this letter indicate the page number; the first digit after this letter gives the column number on the page (counting from the right), and the second digit gives the number of the character in the column (counting from the top down).
Property TagtLFW1986note
StatusProvisionalCategoryDictionary-like Data
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1986 for the full bibliographic reference.] This field contains notes from the proofing of the Academia Sinica font and mapping data. As with the tLFW1997note data, it may be useful to font mappers/proofers. The four digit codes beginning each record point to a specific tLFW1986 record.
Property TagtLFW1997
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/^(\d{4})$/; $1 >= 0001 && $1 <= 6217;Records5910
Description李范文 Lǐ Fànwén (1997):《夏漢字典》Xià-Hàn Zìdiǎn [Tangut / Chinese Dictionary; ISBN: 7-5004-2113-3.] This dictionary has 6,000 numbered entries, including variants and duplicates mapped in the dictionary entries. The four-digit serial numbers in this source reconcile the serial numbers and glyphs appearing in the “Four-Corner” index (LFW 1997:1-30) to the serial numbers and glyphs in the body of the dictionary. Serial numbers > 6000 are virtual mappings (for non-tLFW1997 characters). The font “xiahan.ttf” (Column X in the Multi-Column Code Chart) was produced from scans of the “Four-Corner” index of this source (p. 1-30; see tLFW1997Index). All field names beginning with tLFW1997 relate to this source.
Property TagtLFW1997FCC
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field contains the “Four Corner Code” data from the index (LFW 1997:1-30), in the form “1010.00” (prefixed with a tLFW1997 serial number, where tLFW1997VC has two or more members). The first four digits give the Four-Corner Code, and the final two digits serve to distinguish identical codes by means of the codes of the two stroke types at the bottom center of the character. In LFW’s 1997 FCC system (which serializes the dictionary entries), there are three “single stroke” classes (1..3), and six “compound (assemblages of strokes)” classes (4..9), for a total of 9 FCC classes.
Property TagtLFW1997GlossChi
StatusProvisionalCategoryDictionary-like Data
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field contains a version of the Chinese gloss data appearing in the lexical entries of tLFW1997. Input of this data is a work in progress, and contributions are welcome. A tLFW1997 serial number is prefixed, where tLFW1997VC has two or more members.
Property TagtLFW1997GlossEng
StatusProvisionalCategoryDictionary-like Data
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field contains the English gloss data from tLFW1997, spell-checked, corrected, and augmented. A tLFW1997 serial number is prefixed, where tLFW1997VC has two or more members. The English gloss given in tLFW1997 provides a short translation of some of the more important subentries in the Chinese gloss (tLFW1997GlossChi): the Chinese text often contains a great deal more information. The following conventions were used in the input of this data: “[aux.]” indicates an “auxiliary verb” (of some type); “[modal aux.]” → “modal auxiliary”, “[prep.]” → “preposition” ; “[translit.]” → “transliteration” (“音”; indicates the class of Tangut characters used in transliteration of non-native syllables, i.e. Chinese, Buddhist terms); in the few cases where a Chinese gloss was present in the original, but no English translation at all was given in the lexical entry, an English gloss has been added here; “[var.]” → “variant”: where there was no Chinese gloss and no English gloss, but there is a cross-reference to a variant form of the character (see tLFW1997, tLFW1997VC, or tLFW1997vars for the variant mappings, not duplicated here); “???” → “不識” (indicates that the meaning of the Tangut character was unknown to LFW, and the character is alone in its varclass); several other classes of character usage are given in square brackets, e.g.: “[affix]”, “[animal name]”, “[bird name]”, “[grass name]”, “[insect name]”, “[place name]”, “[prefix]”, “[star name]”, “[suffix]”, “[surname]”, “[tree name]”, “[trigram name]”, “[vegetable name]”, “[verb]”.
Property TagtLFW1997Index
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field gives the grid coordinates for the character in the index (LFW 1997:1-30), reconciling this data with the serialization in the main body of the dictionary. (A tLFW1997 serial number is prefixed, where tLFW1997VC has two or more members.) Values have the form “01-06-05”, with three hyphenated pairs of digits: “page number (01..30)” - “column number (left-to-right)” - “row number (top-to-bottom)”. This data was generated in the image processing for the building of “xiahan.ttf” (Column X in the Multi-Column Code Chart). Discrepancies between the index and the body of the dictionary were resolved on the basis of the lexical entries themselves (see “LFW1997PR”).
Property TagtLFW1997IndexPR
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field documents “Pair Reversals” in the data for tLFW1997Index (q.v.). A tLFW1997 serial number is prefixed, where tLFW1997VC has two or more members.
Property TagtLFW1997Init
StatusProvisionalCategoryDictionary-like Data
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field gives the Initial Class (cp. tLFW1986Init), or “--” if the class is unknown (see LFW1986). A tLFW1997 serial number is prefixed, where tLFW1997VC has two or more members.
Property TagtLFW1997Num
StatusProvisionalCategoryNumeric Values
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field gives the numeric value of characters which have numeric (or more loosely, rather specific quantitative) glosses in tLFW1997; common characters for the cardinal numbers (1..10; 100; 1,000; 10,000; 100,000,000) are marked “[cardinal]”.
Property TagtLFW1997PUA
StatusProvisionalCategoryOther Mappings
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/(?:\d{4}:\s)?([0-9a-f]{4})/ig; hex($1) >= 0xE000 && hex($1) <= 0xF848;Records5910
Description[See tLFW1997 for the full bibliographic reference.] Sequential Unicode Private Use Area (PUA) mappings (0xE000..0xF848) of all 6,000 tLFW1997 characters, including unified duplicates and variants, plus the 217 non-tLFW1997 characters (at the end). Where the tLFW1997VC varclass has two or more members (i.e., there are multiple rows in the Multi-Column Code Chart), there are multiple PUA code point values, each prefixed with the tLFW1997 serial number followed by “colon + space”. This PUA encoding was used for both “xiahan.ttf” (column X) and “XXT.ttf” (Column Z) in the multi-column chart; “HXM.ttf” (Column Y) uses a different PUA encoding sequential for that source (see “HXM2007”). All three of these TTF contain glyphs for the entire repertory of 6,217 (with variants and duplicates); the tLFW1986 (Column W) has only the TY subset.
Property TagtLFW1997Phon
StatusProvisionalCategoryDictionary-like Data
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field gives a phonological reconstruction of the syllable (as given in the Radical/Stroke index, LFW 1997:1091-1166; reconstructions after 龚煌城 Gong Huang-cherng), or a cross-reference in the form “= #0486” to a variant Tangut character (tLFW1997 serial number); reconstructions in square brackets are variants; data input by Andrew West, and collated by him against data from Guillaume Jacques “龚煌城西夏文拟音”. A tLFW1997 serial number is prefixed, where tLFW1997VC has two or more members.
Property TagtLFW1997Rad
StatusProvisionalCategoryRadical-Stroke Counts
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/(?:\d{4}:\s)?(\d{3})/; $1 >= 1 && $1 <= 385;Records5841
Description[See tLFW1997 for the full bibliographic reference.] This field assigns the tLFW1997 radical (Radical/Stroke index, LFW 1997:1091-1166; cp. LFW1986Rad). A tLFW1997 serial number is prefixed, where tLFW1997VC has two or more members.
Property TagtLFW1997ResStr
StatusProvisionalCategoryRadical-Stroke Counts
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field assigns the tLFW1997Rad Residual Stroke Count (Radical/Stroke index, LFW 1997:1091-1166; cp. LFW1986ResStr), as a hyphenated range. A tLFW1997 serial number is prefixed, where tLFW1997VC has two or more members.
Property TagtLFW1997Rhyme
StatusProvisionalCategoryDictionary-like Data
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field gives the Rhyme Class (cp. LFW1986Rhyme), or “--” if the Rhyme Class is not given. A tLFW1997 serial number is prefixed, where tLFW1997VC has two or more members.
Property TagtLFW1997VC
StatusProvisionalCategoryVariant Relations
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This data derives from (and is best understood in relation to) tLFW1997vars, grouping variants according to the size of the variant class, into doubles, triples, and quadruples. A decimal value of the type “2.3004” indicates a varclass of size two (two adjacent records with the same Unicode code point in the “Multi-Column Code Chart”), unified under LFW1997 index “3004” (lowest index in the pair of “3004” and “6096”). If tLFW1997 is virtual (and tLFW1997vars is empty), this record contains the tHXM2004 or tLFW1986 (TY) mapping (redundantly). A value of zero indicates that there is no tHXM2004 mapping. See tUniMCCC.
Property TagtLFW1997note
StatusProvisionalCategoryDictionary-like Data
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This field (extracted from the main “xiahan” database controlling the whole repertory) contains notes created in the various stages of the font and mapping proofing process. A four digit code prefix “0001: ” points to a specific tLFW1997 record. These notes in general relate to the UniRepGlyph (Column Z), though it may relate to other columns (W,X,Y) in the “Multi-Column Code Chart” (see tUniMCCC). This data identifies a subset of problematic glyphs, and is included in the public release since it may be useful to anyone proofing an existing Tangut font or set of mappings. In particular, we here note a number of issues relating to errors or divergences in the print sources, and errors in the Mojikyo character set bitmaps and mappings. Noted glyph errors relating to Columns (X,Y,Z) in the “Multi-Column Code Chart” were corrected; some glyph errors in tLFW1986 (Column W) are simply noted (see also tLFW1986note).
Property TagtLFW1997vars
StatusProvisionalCategoryVariant Relations
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description[See tLFW1997 for the full bibliographic reference.] This data duplicates the tLFW1997 mapping of duplicates and variants, but sometimes includes some additional information. Each record contains one or more four-digit tLFW1997 codes; a “?” prefix indicates a questionable variant assignment; records with line-initial “?” are notes, indicating that all noted variants are speculative; if no variants are noted after a leading “???”, this indicates a non-tHXM2004 character that is not a member of a tLFW1997 varclass (i.e. rare or poorly understood characters).
Property TagtNevsky
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
DescriptionНевский, Н. А. (Николай Александрович) [1892-1938] (1960) Тангутская Филология. Издатепьство Восточны литературы, Москва [Tangut Philology. 2 vols (Russian) Moskow]. UCB MAIN PL 3801 S5N4 v.1-2. 這部分的資料是 Sofronov 來台訪問時提供的未出版資料, 主要是 Sofronov 的編號對照 Nevsky 字典的位置, 一個字最多在 Nevsky 字典出現四次, 所以在資料庫裡分為四欄. (This field may have a maximum of four space-delimited values.)
Property TagtNishida
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description《西夏語的研究》 (西田龍雄 Nishida Tatsuo) 第二冊, 西夏文字小字典 APPENDIX I, p. 303-507上面的編號.
Property TagtSofronov
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
DescriptionСофронов, М. В. (M. V. Sofronov) 索夫羅諾夫著的《西夏語文法》 (Грамматика Тангуцково Языка [Grammatika Tangutskovo Yazyka ‘Tangut Garammar’], 1968).
Property TagtUniMCCC
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
DescriptionUnicode Multi-Column Code Chart. This field documents the rows of mappings appearing in the “Multi-Column Code Chart”. For each tLFW1997VC varclass with two or more members, the “Multi-Column Code Chart” has multiple rows per Unicode code point. Column designations in the “Multi-Column Code Chart” are as follows (W: tLFW1986; X: tLFW1997; Y: tHXM2004; Z: tUniRepGlyph). The mappings between the first three columns (W,X,Y) for a given record are given as a single tUniMCCC value, and each tUniMCCC record will have two or more values (except for the four cases with only one value; see tHXM2004VC). If there is no tUniMCCC value for a given UCS code point, this code point has but a single row in the “Multi-Column Code Chart”, and a tUniMCCC value would be simply the concatenation of any source mappings, at most one per source (W,X,Y). See also tUniRepGlyph.
Property TagtUniRepGlyph
StatusProvisionalCategoryVariant Relations
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/^[01]$/; /^(\d{4})$/; $1 >= 0001 && $1 <= 6217;Records298
DescriptionOriginally a binary field, controlling selection of the “representative glyph” appearing in the Single-Column Code Chart. A digit one (1) in this record selected this as the representative glyph. In varclasses with more than one member, exactly one class member is selected; in varclasses with only a single member that one member is (of course) selected. In export of this data, the binary value == 1 is replaced by the associated tLFW1997 serial number (identifying the record as a whole providing mappings to the rep glyph; see tUniMCCC for the full set of mappings). In a few cases errors in the print-source source-mappings slightly confuse the issue of rep glyph identification. In general the Unicode rep glyphs seek to adhere as closely as possible to the stroke counts and components given in tHXM2004, though lack of explicit stroke counts in that source, reflective of general variation and uncertainty in stroke counting provide a practical limit. For example, the “hp” (橫撇) stroke is variably treated as “h” + “p” (and vice versa), and we have not sought to clarify this beyond what is evident in the sources.
Property TagtUnicode
StatusProvisionalCategoryWG Mappings
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Syntax/^U\+([0-9A-F]{5})$/; hex($1) >= 0x17000 && hex($1) <= 0x18715;Records5910
DescriptionUnicode Code Point. The “U+” prefixed uppercase hexadecimal representation of the Unicode “UCS Plane 1” code point assignment of this Tangut character (Column 1 in “UniTangut.txt”). The code point assignment itself can be viewed as a property of the abstract character, especially in the encoding process, and in the mapping of the abstract character to concrete instances in particular sources. The Unicode Code Point (number) and associated Representative Glyph (image) together constitute a fundamental unit of information in Unicode character encoding (see also tUniRepGlyph).
Property TagtWenhai
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description《文海》 Wén Hǎi (Ксения Борисовна Кепинг [K. V. Keping] et al., 1969).
Property TagtWenhaiYanjiu
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description《文海研究》 Wén Hǎi Yánjiū (史金波, 白濱, 黃振華, 1983).
Property TagtYitongYilei
StatusProvisionalCategoryDictionary Indices
Added5.X [2007-08-19]Modified5.X [2007-08-19]
Description《義同》一類 Yì Tóng yīlèi (李范文, 韓小忙, 2000; cf. 韓小忙 2004:354).


[Feedback] http://www.unicode.org/reporting.html
For reporting errors and requesting (or offering) information online.
[Reports] Unicode Technical Reports
For information on the status and development process for technical reports, and for a list of technical reports.
[Unicode] The Unicode Standard, Version 5.0
[Versions] Versions of the Unicode Standard
For details on the precise contents of each version of the Unicode Standard, and how to cite them.


This section indicates the changes introduced by each revision.


Copyright © 2007 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.







Valid XHTML 1.0 Transitional