Proposed Update Unicode® Standard Annex #45

U-source Ideographs

Version	Unicode 15.1.0
Editor	John H. Jenkins 井作恆Ken Lunde 小林劍󠄁
Date	2023-06-25
This Version	https://www.unicode.org/reports/tr45/tr45-28.html
Previous Version	https://www.unicode.org/reports/tr45/tr45-27.html
Latest Version	https://www.unicode.org/reports/tr45/
Latest Proposed Update	https://www.unicode.org/reports/tr45/proposed.html
Revision	28

Summary

This annex describes U-source ideographs as used by the Ideographic Research Group (IRG) in its CJK ideograph unification work.

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For any errata which may apply to this annex, see [Errata].

1 Introduction
- 1.1 U-Source Identifier Prefixes
- 1.2 Classes of U-Source Ideographs
- 1.3 Data Files
- 1.11.4 Submission Process
2 Text File Data
- 2.1 The Status Field
- 2.2 The Source Field
- 2.3 The Comments Field
- 2.4 The First Residual Stroke Field
- 2.5 The Ideographic Description Sequence (IDS) Field
3 U-Source Additions
References
Acknowledgements
Modifications

1 Introduction

This annex describes a subset of IRG U-source ideographs as used by the Ideographic Research Group (IRG) in its CJK ideograph unification work.

The IRG is a subgroup of ISO/IEC JTC 1/SC 2/WG 2 and has the formal responsibility of developing extensions to the encoded repertoires of unified CJK ideographs. The IRG consists of members of ISO/IEC contributors and liaison organizations, including many East Asian countries and the USA. The Unicode Consortium participates in this group as a liaison member of ISO. Each time the IRG begins the process of preparing a new CJK Unified Ideographs extension, IRG members submit a set of charactersideographs for potential inclusion in that extension. The IRG classifies these into sources, one for each submittermember body, e.g.,such as the J-source for Japan, the V-source for Vietnam, and so on.

The IRG U-source consists primarily of submissions from the Unicode Technical Committee (UTC). These include charactersideographs submitted by the UTC on behalf of the United Kingdom. The IRG U-source also includes charactersideographs encoded because they were originally submitted to the IRG by some other member body. Some of these are charactersideographs which were submitted to the UTC for consideration but were not submitted by the UTC to the IRG, and were later associated with a U-source. (The IRG refers to such cases as horizontal extensions.) Others were left without a formal IRG source by changes made by the IRG in its source- mappings; these were “adopted” by the UTC as explained below.

1.1 U-Source Identifier Prefixes

Each characterideograph in the IRG U-source has an identifier which consists of two or three letters (the prefix) followed by a hyphen and five zero-padded decimal digits. The prefixes currently used are listed below.

Identifier Prefix	Responsible Body
`UTC`	Unicode Technical Committee
`UCI` (obsolete)	Ideographic Research Group
`UK`	United Kingdom

This UAX provides a formal reference to U-source ideographs, so that they may be referred to in other documents by their U-source identifiers. In many instances, it also provides a public record of ideographs which were submitted to the Unicode Technical CommitteeUTC for consideration.

1.2 Classes of U-Source Ideographs

The U-source database consists of four classes of CJK ideograph:

Ideographs which have been submitted to the UTC as potential candidates for encoding. Note that not all such ideographs are actually suitable for encoding as a CJK Unified Ideograph. Those that are not have a status of Rejected.
Placeholder ideographs required to maintain continuity of U-source indicesidentifiers. Early versions of the U-source database allowed for the possibility of ideographs being withdrawn, generally because they had been added erroneously. Replacement ideographs were added in their place to keep any U-source indexidentifier from being skipped. All such ideographs have a status of Rejected. (Ideographs are no longer withdrawn from the U-source database after they have been added.)
Placeholder ideographs required to provide encoded CJK Unified Ideographs with IRG source information. All CJK Unified Ideographs in ISO/IEC 10646 are required to have at least one source identifier. Changes to IRG source information, however, can leave a given ideograph without any such sources. In such cases, the ideograph is included in the U-source database to guarantee it has at least one source. Such ideographs are indicated by a source prefix of UCI instead of UTC. This practice is no longer followed and the UCI prefix is now obsolete.
Formerly, Further changes to IRG source information may have restored non–U-source sources or provided new ones for such ideographs. In such cases, the ideograph retainsed the UCI prefix, but the encoded characterideograph losest its kIRG_USource referenceproperty value in the Unihan database and ISO/IEC 10646. This practice is no longer followed and the UCI prefix is now obsolete.
Ideographs submitted by the UK to the IRG via the UTC for Working Set 2015. These ideographs are included to provide a standard, stable source reference for IRG purposes. Such ideographs are indicated by a source prefix of UK instead of UTC. UK-submitted charactersideographs are no longer submitted through the UTC. No further charactersideographs with the UK prefix will be added. For more information about these ideographs, please see UK-Source Ideographs.

1.3 Data Files

The actual U-source data are found in the following three additional files:

[Glyphs45], a PDF file showing the glyphs for the U-source ideographs. This document is a simple matrix with the representative glyph for a U-source ideograph and its identifier in each cell. The representative glyphs used are drawn in a modern style, such as is used by the IRG in its unification work. The use of modern forms for some charactersideographs originally drawn in a seal style should not be taken as implying any mechanism for the inclusion of seal forms as a whole in [Unicode].
[RSChart45], a PDF file providing a radical-stroke chartindex for U-source ideographs. These charts useThis index uses the glyphs found in the main glyph chart.
[Data45], a text file containing information regarding the ideographs. A detailed description of this file followscan be found in Section 2, Text File Data.

1.11.4 The Submission Process

Additions to the U-source may be proposed by submitting a document to the Unicode Technical CommitteeUTC. The document should contain adequate data to process the request. This includes:

A representative glyph of sufficient detail to use in the production of a TrueType font and in IRG work. Requests with a large number of charactersideographs should include a TrueType font containing the representative glyphs.
Metadata as found in the various U-source database fields.
Sufficient evidence for inclusion in a proposal to the IRG.

Requests with insufficient data are likely to be declined.

2 Text File Data

The text file consists of UTF-8 text. Each line consists of the following fields separated by semicolons.

The ideograph’s U-source identifier. This consists of the letters UTC, UK, or UCI, followed by a hyphen and five zero-padded decimal digits, starting with 00001. Identifier numbers are not skipped, and are not reused. Identifier numbers are assigned sequentially. Ideographs whose prefix is UTC are either those submitted to the UTC for consideration or those included in the U-source database for placeholder purposes. Ideographs included to guarantee an IRG source reference have the prefix UCI.
A string indicating the ideograph’s current status. These are described below.
A Unicode code point. This field is generally empty. Its interpretation depends on the characterideograph’s status and is documented below.
A radical-stroke index for the ideograph, as described in Unicode Standard Annex #38, “Unicode Han Database (Unihan)” [UAX38].
A KangXi dictionary index for the ideograph, as described in Unicode Standard Annex #38, “Unicode Han Database (Unihan)” [UAX38]. This field is no longer used and contains no data.
An Ideographic Description Sequence (IDS) for the ideograph, if one can be generated.
A string indicating the ideograph’s source and an optional index within the source.
General comments regarding the ideograph.
The ideograph’s total stroke count, as described in Unicode Standard Annex #38, “Unicode Han Database (Unihan)” [UAX38].
The ideograph’s first residual stroke.

2.1 The Status Field

The status field reflects the ideograph’s current status. The value of this field can change over time. The possible values are Comp, ExtA, ExtB, ExtC, ExtD, ExtE, ExtF, ExtG, ExtH, FutureWS, NoAction, Rejected, UK-2015, URO, Variant, WS-2017, WS-2021, and strings matching the regular expressions UTC-\d{5}, UCI-\d{5}, and UK-\d{5}. New values may be added in the future and existing values removed.

Status	Meaning	Value of Unicode Field
`Comp`	Encoded as a compatibility ideographCJK Compatibility Ideograph	The characterideograph’s code point
`ExtA`	Encoded in Extension A	The characterideograph’s code point
`ExtB`	Encoded in Extension B	The characterideograph’s code point
`ExtC`	Encoded in Extension C	The characterideograph’s code point
`ExtD`	Encoded in Extension D	The characterideograph’s code point
`ExtE`	Encoded in Extension E	The characterideograph’s code point
`ExtF`	Encoded in Extension F	The characterideograph’s code point
`ExtG`	Encoded in Extension G	The characterideograph’s code point
`ExtH`	Encoded in Extension H	The characterideograph’s code point
`FutureWS`	Earmarked to be included in a proposal from the UTC to the IRG for a future extension	The code point of a characteran ideograph to which this is related, generally as a variant
`NoAction`	Appropriate disposition has not been determined	The code point of a characteran ideograph to which this is related, generally as a variant
`Rejected`	Not suitable for encoding as a CJK Unified Ideograph (see below)	The code point of a characteran ideograph to which this is related, generally as a variant
`UK-2015`	Submitted by the UK for IRG Working Set 2015	The code point of a characteran ideographa to which this is related, generally as a variant
`URO`	Encoded in the URO, or as a unified ideograph in the CJK Compatibility Ideographs block	The characterideograph’s code point
`Variant`	A variant of an encoded ideograph (see below)	The code point of the characterideograph of which this is a variant
`WS-2017`	Submitted by the UTC for IRG Working Set 2017	The code point of a characteran ideograph to which this is related, generally as a variant
`WS-2021`	Submitted by the UTC for IRG Working Set 2021	The code point of a characteran ideograph to which this is related, generally as a variant
Strings matching the regular expressions `UTC-\d{5}`, `UCI-\d{5}`, and `UK-\d{5}`.	Duplicate entries deprecated in favor of other entries; the status value is the identifier of the non-deprecated characterideograph	The characterideograph’s code point, or the code point of a characteran ideograph to which this is related, generally as a variant

A status of Comp means that the ideograph is encoded in the Unicode Standard as a compatibility ideograph. The value of the ideograph’s IDS field is U+303E 〾 IDEOGRAPHIC VARIATION INDICATOR (U+303E) followed by the ideograph of which it is a compatibility variant. For example, UTC-00932 is encoded as U+FA26 都, which is a compatibility variant of U+90FD 都. UTC-00932 therefore has the IDS field value 〾都 (U+303E U+90FD).

A status of Variant means that the ideograph is a variant of a characteran ideograph encoded in the Unicode Standard. These variants are not limited to z-variants. Other variants include glyphs with components rearranged (for example UTC-00344, which rearranges the components of U+69AB 榫 but is pronounced the same and means the same), simplified versions of encoded charactersideographs (for example UTC-00842), and ideographs which mean the same and are pronounced the same as encoded ideographs and have a sufficiently similar shape as to be easily mistaken for one another (for example UTC-00399). This is a deliberately less strict, if somewhat more subjective, standard than is used for unification work.

A status of Rejected means that the ideograph is not suitable for encoding as a CJK Unified Ideograph. An example here is UTC-00326, which is a nonce form specifically coined for use with Figure 18-9, Using the Ideographic Description Characters, in [Unicode] and to fill an empty slot in the U-source database. While the characterideograph does have an intended meaning (“frog at the bottom of a well”), it isn’t suitable for encoding because of its ad hoc nature and lack of generalized use. The bulk of the charactersideographs with a status of Rejected are Wenlin-specific z-variants which should be represented (if at all), via a variation sequence defined by Wenlin, not by the UTC.

2.2 The Source Field

The source field consists of source information, which consists of a source tag usually followed by a source-specific index string. Source tags and indices are separated by a space, and multiple source indices are separated by commas. Multiple sources are separated by asterisks.

Note that the sources listed here may not provide adequate evidence of use for IRG work. This is partly because charactersideographs listed here may not be suitable candidates for encoding, but also because IRG requirements for evidence have become increasingly stringent over time. Many of the charactersideographs in each of the sets encoded prior to Extension D do not have adequate evidence of use by current IRG standards.

The source tag may be a URI, in which case the index string is the date (year-month-day) when the URI was accessed. The source tag may also be a U-source indexidentifier for cases where an ideograph was added to the U-source twice. The source tags beginning with a lowercase k correspond to fields within the Unihan database. Please consult Unicode Standard Annex #38, “Unicode Han Database (Unihan)” [UAX38], for information on these sources and the format and meaning of the index strings.

The remaining sources are listed below. The left column contains the source tag. The center column contains bibliographic information for the source. The third column contains a description of source index, if any. The description frequently includes a regular expression which the index matches; see Unicode Standard Annex #38, “Unicode Han Database (Unihan)” [UAX38], for more information.

Source Tag	Source Bibliographic Information	Source Index
ABC2	DeFrancis, John. ABC Chinese-English Dictionary. Honolulu: University of Hawaiʻi Press, 1999.	None
Adobe‑CNS1	The Adobe-CNS1 glyph collection	The glyph index within the set matching the regular expression `(C\+)?[0-9]{1,5}`
Adobe‑Japan1	The Adobe-Japan1 glyph collection	The glyph index within the set matching the regular expression `(C\+)?[0-9]{1,5}`
Cheng	Cheng Tso-Hsin, ed. A complete checklist of species and subspecies of the Chinese birds. Beijing: Science Press, 2000.	None
CN	Vũ Văn Kính, ed. Đại Tự Điển Chữ Nôm. Ho Chi Minh City: Nhà xuấ bản văn nghệ. 1998	A string matching the regular expression `[01][0-9]{3}\.[0-9]{2}` indicating the page and position on the page.
DYC	《說文解字•注》 Shuō Wén Jiě Zì — Zhù [Annotated Qíng Dynasty recension of the Eastern Hàn Chinese analytic dictionary SWJZ]. 〖東漢〗許慎著 (121 AD)，〖清〗段玉裁注 (1815)。 [上海古籍出版社, 1981.] See Cook (2003:461 ff; UMI #3105189) for complete references to the various editions: http://linguistics.berkeley.edu/~rscook/html/writing.html#EHC CharactersIdeographs from the DYC were added to the U-source database as part of a preliminary exploration of the possibility of encoding them. They will not be used for any effort to actually encode the contents of the DYC and should not be taken as the basis for any such encoding.	A string matching the regular expression `[0-9]{3}\.[0-9]{2}[01]` indicating the page and position on the page.
GB18030-2000	GB 18030-2000	None
LDS	Required Character List Supplied by The Church of Jesus Christ of Latter-day Saints	The characterideograph index within the document
Shangwu	Huang Giangshang, ed. Shangwu Xin Cidian. Hong Kong: The Commercial Press, 1991. ISBN 962-07-0133-X	A string matching the regular expression `[0-9]{3}\.[0-9]{2}` indicating the page and position on the page.
TUS	[Unicode]	The characterideograph’s code point matching the regular expression `U\+2?[0-9A-F]{4}`
UDR	A defect report filed against the Unicode Standard or other direct communication with the Unicode eEditorial cCommittee	None
UTCDoc	A UTC document	The document number optionally followed by a decimal index for the characterideograph within the document
XHC	《现代汉语词典》 [Xiàndài Hànyǔ Cídiǎn = XHC; ‘Modern Chinese Dictionary’]. 中国社会科学院语言研究所词典编辑室编 [Chinese Academy of Social Sciences, Linguistics Research Institute, Dictionary Editorial Office, eds.]. 北京: 商务印书馆, 2002. This is a later edition of the `kXHC1983` source.	The page and position information in the format used by the `kXHC1983` source
WG2	A WG2 document	The document number
WL	Wenlin v. 3.1.8 http://www.wenlin.com	The PUA code point assigned the ideograph matching the regular expression `E[0-9A-F]{3}`

2.3 The Comments Field

The comments field is a general-purpose, unstructured field. It is generally empty. It can contain any Unicode character other than tabs, semicolons, and any line-break character. The purpose of this field is to provide any additional relevant information for an ideograph which is not included in any other fields. For example:

The comment field for UTC-00119 indicates why it is inappropriate for encoding.
The comment field for UTC-00595 clarifies its variation relationship to other charactersideographs.
The comment field for UTC-03215 contains a kCantonese property value pending encoding.

A colon is used within the comments field as a separator for multiple comments.

2.4 The First Residual Stroke Field

The first residual stroke is a number 1 through 5. It indicates the stroke type of the first-written stroke of the characterideograph, exclusive of the radical. The first residual stroke is determined by the standard rules for writing CJKV ideographs. The five-stroke system is frequently encountered in East Asian sorting algorithms or input methods. In particular, it is used by the IRG in its unification work and is required for IRG submissions. The five strokes, with their Chinese names in parentheses, are listed below:

A horizontal stroke, such as 一 (橫, héng)
A vertical stroke, such as 丨 (豎, shù)
A downward right-to-left stroke, such as 丿 (撇, piě)
A dot or downward left-to-right stroke, such as 丶 (點, diǎn)
A hook, such as 乙 (折, zhé)

If an ideograph has no residual strokes, its first residual stroke value shall be set to 0 (zero).

2.5 The Ideographic Description Sequence (IDS) Field

The iIdeographic dDescription sSequence is of a form used by the IRG in its unification work. These ideographic description sequencIDSes follow the syntax defined in [Unicode] with two extensions to handle unrepresentable ideographs:

The IDS may be preceded byprefixed with U+303E 〾 IDEOGRAPHIC VARIATION INDICATOR (〾) to mark the IDS as approximate and not exact.
The IRG defines a number of SGML-like entities for unencoded ideograph components. In IRG work, these are delimited by an ampersand (&) and semicolon (;). To maintain compatibility with the use of semicolons as field separators, they are delimited by square brackets ([ and ]) in [Data45].

3 U-Source Additions

The table below lists the U-source ideographs that were added in each version of the Unicode Standard, and those with identifier prefixes other than UTC being highlighted for easier identification:

Version	Count	Range
Prior to 6.3.0	952	UTC-00001 .. UTC-00936, UCI-00937, UCI-00938, UTC-00939, UCI-00940 .. UCI-00942, UTC-00943, UCI-00944 .. UCI-00948, UTC-00949 .. UTC-00952
6.3.0	245	UTC-00953 .. UTC-01197
7.0.0	1	UTC-01198
8.0.0	3	UCI-01199, UTC-01200, UTC-01201
9.0.0	1,768	UTC-01202 .. UTC-01312, UK-01313 .. UK-02968, UCI-02969
10.0.0	6	UTC-02970 .. UTC-02975
11.0.0	192	UTC-02976 .. UTC-03158, UCI-03159, UTC-03160 .. UTC-03167
12.0.0	37	UTC-03168 .. UTC-03204
13.0.0	6	UTC-03205 .. UTC-03210
14.0.0	28	UTC-03211 .. UTC-03238
15.0.0	59	UTC-03239 .. UTC-03297
15.1.0	39	UTC-03298 .. UTC-03336

References

For references for this annex, see Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.”

Acknowledgements

John H. Jenkins 井作恆 (RIP) was the author of the initial version of this annex, and served as the editor up through and including Version 27 for Unicode Version 15.0.0.

The UTC gratefully acknowledges the contributions of Eiso Chan, Henry Chan, Jaemin Chung, Lee Collins, Richard Cook, Jing Zuoheng, Ken Lunde, Ming Fan, William Nelson, Andrew West, and others to the U-source database.

Modifications

The following summarizes modifications from the previous revision of this annex.

Revision 28

Proposed update for Unicode 15.1.0.
Removed John H. Jenkins 井作恆 as editor.
Added Ken Lunde 小林劍󠄁 as editor.
Changed the N, V, W, and X status values to FutureWS, Variant, Rejected, and NoAction, respectively.
Removed the now-obsolete UK-2015 and WS-2017 status values.
Inserted headers for Section 1.1, Section 1.2, and Section 1.3.
Renumbered Section 1.1 as Section 1.4.
Added Section 3.
Added a paragraph about the editor change to the Acknowledgements section.

Revision 27

Reissued for Unicode 15.0.0.
Changed status values “A”, “B”, “C”, “D”, “E”, “F”, “G”, and “U” to “ExtA”, “ExtB”, “ExtC”, “ExtD”, “ExtE”, “ExtF”, “ExtG”, and “URO”, respectively.
Added status ExtH.
Added Section 2.5.

Modifications for previous versions are listed in those respective versions.

© 2023 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.