Table of Contents
- 1. Introduction
- 1.1. Scope of IRG Work
- 1.2. Scope of This Document
- 2. Development of CJK Unified Ideographs
- 2.1. Principles on Identification of CJK Unified Ideographs
- 2.2. Principles on Submission of Ideographs to IRG
- 2.3. Principles on Production of IRG Working Drafts
- 2.4. Principles on Reviewing IRG Working Drafts
- 2.5. Principles on Discussions at IRG Meetings
- 2.5.1. Record-based Discussion
- 2.5.2. Discussion Procedure
- 2.5.3. Recording of Discussions
- 2.5.4. Time and Quality Management
- 2.6. Principles on Submission of Ideographs to WG 2
- 2.6.1. Checking of Stabilized M-set
- 2.6.2. Preparation for WG 2 Submission
- 3. Procedures
- 4. Guidelines for Comments and Resolutions on Working Sets
- 4.1. Guidelines for M-set
- 4.2. Guidelines for D-set
- 5. IRG Website
- 6. IRG Document Registration
- Annex A: Sorting Algorithm of Ideographs
- Annex B: IDS Matching
- Annex C: Urgently Needed Ideographs
- C.1. Introduction
- C.2. Requirements
- C.3. Dealing with Urgent Requests
- Annex D: Up-to-date CJK Unified Ideograph Sources and Source References
- Annex E: Maintenance Procedure of IRG Working Document Series
- E.1. Introduction
- E.2. IRG Working Document Series
- E.3. Maintenance Procedure
- Annex F: IRG Repertoire Submission Summary Form
- Annex G: Examples of New CJK Unified Ideographs Submissions (akai.e., Vertical Extensions)
- Annex H: [Reserved for future use]
- Annex I: Guideline for Handling of CJK Ideograph Unification and/or Dis-unification Error
- Annex J: Guideline for Correction of CJK Ideograph Mapping Table Errors
- Annex K: List of First Strokes
- Annex L: Guidelines for Forming Working Sets with an Upper Limit
- References
- Glossary
- Versions
1. Introduction
This document is a standing document of the ISO/IEC JTC 1/SC 2/WG 2/IRG for the standardization of Chinese-Japanese-Korean-Vietnamese (CJKV, though CJK will be used throughout this document) iUnified Ideographs. It consists of a set of principles and procedures on a number of items relevant to the preparation, submission, and development of repertoires of CJK Unified Ideographs extensions for addition to the ISO/IEC 10646 standard as new CJK Unified Ideographs extension blocks or appended to existing CJK Unified Ideographs blocks. Submitters should check the standard documents—including all the amendments and corrigenda—before preparing new submissions.
For any issue that is not explicitly covered in this document, IRG will follow WG 2 Principles and Procedures and other higher-level directives.
- 1.1. Scope of IRG Work
IRG works on CJK ideograph–related tasks under the supervision of WG 2 (per SC 2 Resolution M20-07). The following is a list of current and completed IRG projects:
- CJK Unified Ideographs r Repertoire and its extensions
- Kangxi Radicals and CJK Radicals Supplements — completed
- Production of Ideographic Description Sequences (IDSes)
- International Ideographs Core (IICore) — completed
- CJK Strokes — completed
- Update of CJK Unification Rules
Work on new IRG projects requires the approval of WG 2, and preparation of relevant documents for such approval is required before IRG can officially launch any new projects.
- 1.2. Scope of This Document
The following sections are dedicated to the standardization of CJK Unified Ideographs, describing the set of principles and procedures to be applied in the development of new extensions to the CJK Unified Ideograph Repertoire as specified in Section 1.1.a above. In addition, the maintenance of IRG website and registration procedure of IRG documents are detailed in Section 5 and Section 6, respectively.
This document does not cover other IRG work items listed in Section 1.1. Standardizing CJK Compatibility Ideographs maintained in UCS for the purpose of round-trip integrity with other standards is out of IRG scope. However, CJK compatibility characters submitted to WG 2 must be reviewed by IRG to avoid potential problems. For the handling of mis-unification and duplicate ideographs, Annexes I and J of WG 2 Principles and Procedures attached to this document should be referenced.
2. Development of CJK Unified Ideographs
All new extension work must be approved by WG 2 before the actual consolidation and review can be formally carried out. There are no fixed rules for initiating a new extension. IRG can initiate a call for proposals once its current collection is near completion. Any WG 2 member body, authoritative organization, international consortium, or individual expert can initiate a new extension by submitting a proposal which states the need of a required repertoire. Submission of such a proposal must follow the principles and procedures stated in this document. IRG will vet and confirm if the proposal is within its scope of work.
Taking into consideration 1) the urgency and justifications of the proposal; 2) the proposed repertoire size; and 3) IRG’s current workload, IRG may take one of the following actions:
- Endorse the proposal and submit it to WG 2 for approval as an urgently needed repertoire (see Annex C for for guidelines on Urgently Needed Ideographs).
- Invite other reviewers to submit characters of a similar nature so as to estimate the real workload before submitting the repertoireproposal to WG 2 for endorsement.
- Accept the proposal as a contribution to a currentn ongoing IRG work item.
- Reject the proposal with justifications. A rejected proposal may be revised and re-submitted to IRG.
- 2.1. Principles on Identification of CJK Unified Ideographs
- 2.1.1. Principles on Encoding
Ideographs that have the same abstract shape are unified under the unification rules (Annex S of ISO/IEC 10646 as well as IRG’s additional examples in IRG’s Working Document Series; see details in Section 2.1.4) and assigned a single character code. A CJK ideographic character can take many actual forms depending on the writing style adopted. Examples of common writing styles include Song style and Ming style as typical print forms, Kai style as a handwritten form, and Cao style as a cursive form. Stylistically different forms of the same character may involve different numbers or different types of strokes or components, which may in turn affect identification of the abstract shape of the character. In order to reach a common ground for identifying abstract shapes to be encoded as distinct CJK Unified Ideographs, IRG only accepts submissions using a print form of glyphs (usually Song style or Ming style). Other styles of writing are generally not accepted unless an approved transcription normalization specification is accepted by IRG.
IRG further spells out its additional requirements for encoding of characters infor all submissions in all its extensions (from IRG Meeting #53) as follows:
- Script limitationType of scripts (文種限制): Encoding request must be for Han characters of the Han scripts.
- Writing style limitation (字體限制): The supporting evidence for submitted characters in printed form must be in regular scripts (楷書). Other styles cannot be used as evidence for encoding, such as clerical style, small seal, and so onetc.
- Running tText use limitationevidence (文本限制): Characters must appearbe used in running textscript as characters in text. Logos and images used separately from running text are not considered acceptable.
- 2.1.2. Unification Procedure of CJK Ideographs
Standard print forms of CJK ideographs are constructed with a combination of known components or stroke types. Many can be broken down into two components — a radical chosen to classify the character in dictionaries and possibly reflect the meaning of the character, and a phonetic component which represents the pronunciation of the character. Basically, two submitted print forms of glyphs with different radicals are distinct characters even if they have the same phonetic component, such as U+5606 嘆 and U+6B4E 歎. For non-trivial cases, further shape analysis must be conducted. Similar glyphs should be decomposed into radicals, components or stroke types and evaluated by following the unification procedure described in Annex S of ISO/IEC 10646.
- 2.1.3. Non-cognate Rule
Ideographs with different glyph shapes that are unrelated in historical derivation (non-cognate characters) are not unified no matter how similar their glyph shapes may be. The following gives examples of semantically different characters with very similar glyphs. They are considered to have different abstract shapes because they are non-cognate.
- U+620C 戌 and U+620D 戍 differ only in rotated strokes or dots (Per Annex S S.1.5 a)).
- U+4E8E 于 and U+5E72 干 differ only in folding back at the stroke termination (Per Annex S S.1.5 f)).
The non-cognate rule does not apply to characters that have identical glyphs even if the characters are historically unrelated. For example ⿰木几 (U+673A 机; wooden table) and ⿰木几 (Chinese c-simplified form of U+6A5F 機) shall not be separately encoded because they have identical glyphs despite being unrelated in historical derivation.
Characters that are related in historical derivation may also be disunified as long the difference in glyph shape is sufficient for reflecting different semantics. An example is U+9593 間 and U+9592 閒 which are related in historical derivation but are disunified as long as they have different semantics and typically used in mutually exclusive context in present day use. For the purpose of IRG processes, these characters have been and are still considered applicable for the non-cognate rule even if they are related in historical derivation and technically cognate.
For disunification to an encoded character under the non-cognate rule, information and supporting evidence provided by a submitter should include the pronunciation of the submitted character as well as the meaning of the submitted character. However, pronunciation alone is not sufficient information for separate encoding.
- 2.1.4. Maintaining Up-to-date Unification/Non-unification Examples
In Annex S of ISO/IEC 10646, unification/non-unification examples are summarized from past practice and the lists are not exhaustive. If there is ambiguity in applying the unification/non-unification rules, IRG must first have a formal discussion for agreement. In case there are worthy examples for recording, IRG will add them to its lists of unification/non-unification examples maintained as IRG Working Document Series (IWDS) on IRG website. The lists will be reported to WG 2 from time to time as an input for Annex S revisions. The detailed procedure of IWDS update is given in Annex E.
- 2.1.1. Principles on Encoding
- 2.2. Principles on Submission of Ideographs to IRG
- 2.2.1. Basic Rules of Submission and Required Data to be Submitted
IRG accepts various types of submissions as specified below. Along with their submissions, the submitters are required to provide the necessary information for IRG’s consideration.
- New Sources to Standardized Ideographs. For submissions specifying new sources (such as an existing or a new national standard) to existing standardized ideographs, the new sources must be reviewed and approved by IRG before submission to WG 2. Sources and source references in the current ISO/IEC 10646 standard can be found in the “Current IRG Source Prefixes” table on the IRG Source Prefixes pageclause 23 of ISO/IEC 10646 Fourth Edition (2014-09-01). See also Annex D for an up-to-date IRG list of sources.
- New Sources to Working Sets. For submissions specifying new sources to remaining characters in previous standardization stages, the new sources must be reviewed and approved by IRG before they are incorporated by IRG Chief Editor into the up-to-date IRG list of sources for the current IRG working sets. For sources of miscellaneous nature with reference to individual documents that are too tedious to enumerate by IRG, the submitter should group them together and make a permanent site available for reference.
- New CJK Compatibility Ideographs (Vertical extension). Please be aware that WG 2 accepts new CJK Compatibility Ideographs only under very extreme circumstances due to the effects of normalization and the need to add standardized variation sequences to accommodate them. The preferred method of treating unifiable characters whose distinctions are deemed important is by registering them in a new or existing Ideographic Variation Database (IVD) collection as new Ideographic Variation Sequences (IVSes). See Section 2.2.1.g.
To add CJK Compatibility Ideographs, a submitter needs to supply the following information, which will be reviewed by IRG before submission to WG 2 to avoid possible problems of unification or dis-unification with other CJK Unified Ideographs.
- (1) Table showing the following data for each proposed CJK Compatibility Ideograph
- a) UCS code position of the corresponding CJK Unified Ideograph
- b) Glyph(s) of the corresponding CJK Unified Ideograph
- c) Glyph of the CJK Compatibility Ideograph to be printed in the appropriate column of CJK Compatibility Ideographs Code Table
- d) Source reference (for detailed format, see Section 2.2.1.d.(5)
- e) Evidence showing why the CJK Compatibility Ideograph needs to be added to UCS (for examplee.g., a national standard showing two distinct code positions for two glyphs that are one and the same)
- (2) TrueType font containing the glyph to be printed in the appropriate column of CJK Compatibility Ideographs Code Table (for detailed format, see Section 2.2.2.b.)
- (1) Table showing the following data for each proposed CJK Compatibility Ideograph
- New CJK Unified Ideographs (Vertical extension). All CJK Unified Ideograph submissions are subject to the following rules:
- (1) Size for the Working Sets of an IRG Collection: As all collections are defined by submitters according to their own criteria, IRG does not impose a limit on the collection size. However, to rationalize the feasibility of a timely checking process and to achieve a high quality of work within a reasonably short period of time, the size of a collection or a part of an IRG collection, to be reviewed by IRG as a working set normally does not exceed 10,0004,000 ideographs. Based on this principle, submitters should refrain from submitting more than 2,5001,000 characters in each call for an IRG collection. Submitters may also be asked to divide their submissions into subsets to be processed in different working sets of an IRG collection. The guidelines for forming the working set are given in Annex L.
- (2) Pre-submission Unification Checking: Submitters should be EXTREMELY CAREFUL not to submit CJK Unified Ideographs that are already standardized or previously discussed and recorded at IRG meetings. By the nature of ideographs, it is very difficult for IRG expertreviewers to find out all unifiable ideographs. Thus, it is important to maintain high quality at the time of submission. Therefore, any character submission that does not fulfill all the requirements stipulated in Section 2.1.1 would be rejected. Furthermore, a character submission must be accompanied by evidence to satisfy at least one of the following conditions:
- a) Original Source (證據源限制): The source of evidence must be considered authoritative by IRG, as validated by past literature and IRG experts. IRG has the right to reject characters from questionable sources.
- b) Multiple Sources (多源證據): Supply character use evidence from multiple independence sources. IRG has the right to reject characters with evidence of use from only a single source, especially if the source is not considered authoritative by IRG.
- c) Semantics (字理考證): Supply sufficient evidence on the meaning and phonetics. Supply of other information on its origin and evolution would be very helpful.
- d) Context (上下文信息): Sufficient context in text to decipher the semantic meaning of the character. IRG has the right to reject characters that do not have sufficient evidence for IRG to decipher its semantics.
- e) Usage (需求限制): The use of characters must be for justifiable public interest. Examples of public use include evidence of: governmental needs; scientific use; digitization projects for public use; and working systems of significance as accepted by IRG. IRG has the right to reject characters that do not have sufficient evidence for IRG of justifiable public interest.
Submitters must make sure that the ideographs they submit do not fall into any of the following categories:
- a) Ideographs already standardized in the ISO/IEC 10646 standard (including its amendments).
- b) Ideographs currently in WG 2’s working drafts.
- c) Ideographs currently in IRG working sets including both M-set and D-set (see Section 2.3.4 for the purposes of M-set and D-set).
- d) Ideographs mis-unified or over-unified with ideographs in the current standard based on the lists maintained by IRG in its working document series, namely IWDS_MUI and IWDS_NUC.
- e) Ideographs from ancient documents that are rare and not in general use, along with variants from tombstone carvings that are not in circulation nor used in printed form, should have an appropriate base character identified through the use of authoritative dictionaries and other references, then be submitted as IVSes to be registered in a new or existing IVD collection. See Section 2.2.1.g.
- f) Nonce characters are not in general considered suitable for submission to IRG and evidence from the original publications alone of such characters is insufficient. Nonce characters should only be submitted if there is also evidence of significant wider usage.
Low quality submissions may be rejected under the “5% Rule” defined in Section 2.2.5 below.
- (3) Document Registration: All submission documents should be registered as IRG documents with an IRG document number (IRG N). The file names should be in the form of:
nnnnn_mmmm_pppp_sub_sss
whereby nnnn indicates an IRG document number assigned by IRG Convenor, mmmm indicates the submitter’s abbreviation (as listed in Section 2.2.1.d.(5), pppp indicates the collection year (such as 2015 for the 2015 collection), “sub” is an abbreviation short hand for submission, and sss can be any submitter-designated indicator.
- (4) Submission of Over-unified or Mis-unified Ideographs: Submissions of ideographs that are found mis-unified or over-unified within the current standard should follow the principles in Annex I of WG 2 Principles and Procedures. Lists of over-unified and mis-unified ideographs should be maintained by IRG Chief Editor and made available for update in IRG working document series (i.e. IWDS_NUC and IWDS_MUI) according to the maintenance procedure defined in Annex E of this document. For mis-unified non-cognate characters, requests can be made to add new code points for disunified characters.
Note: The source separation rule described in Annex S of ISO/IEC 10646 applies only to those national standards listed in that section and is not applicable to new IRG submissions.
- (5) The following data items for each proposed ideograph must be submitted in CSV (Comma Separated Value) text format (in UTF-8) or Microsoft Excel file format (sequence number starting from 1 is required in the first column of each row):
- a) Source Reference to indicate the source and the name of the glyph image for tracking. The source reference should begin with a designated WG 2 member body abbreviation (G, H, J, K, KP, M, MY, T, UK, UTC or V) or an international consortium designation (currently, the only one IRG has is SAT for the SAT Project) followed by no more than nine9 characters. It should contain only Latin capital letters and Arabic numbers to indicate the source. Numeric values to indicate the position in a specific source should only be followed by a hyphen (“-”) (Please note that underscores (“_”) must not be used for source reference). AThe purpose of source references and an exhaustive list of source references accepted by ISO/IEC 10646 can be found in the “Current IRG Source Prefixes” table on the IRG Source Prefixes pageare provided in Section 23 of ISO/IEC 10646. See Annex D for details of IRG source reference abbreviations.
Note: WG 2 Member body abbreviations correspond to the source standard categories in Section 23 of ISO/IEC 10646 except MY.
- b) Glyph Image should specifyhave a unique PUA code point in the PUA as TrueType font. The character glyph must have a unique U-code corresponding to its font. The font file should be named using the source reference followed by a submission date in the form of YYYYMMDD.
- c) Kangxi Radical Code (primary radical) from 1 to 214 with an additional “.0,” “.1,” or “.2,” or “.3” to indicate a traditional radical (0), simplified radical (1) or, non-Chinese simplified radical (2), or second non-Chinese simplified radical (3). The selection of 0, 1, or 2, or 3 is based on the radical’s glyph shape. The list of radicals with both traditional and simplified glyphs is providedgiven in Annex A.a. If the technically correct (aka semantic) radical for an ideograph hampers its discoverability, or is region-dependent, the primary radical shall be assigned as though made by an ideograph expert who is neither a specialist in the history of the Han script nor familiar with ideograph etymology. The technically correct radical can be assigned as a second radical. Both are shown in the code charts, though the primary one serves as the basis for ordering within a CJK Unified Ideographs block.
Note: The corresponding code point range for Kangxi radicals in ISO/IEC 10646 is from U+2F00 to U+2FD5.
- d) Stroke Count (primary radical) of components other than the radical. Assignment of stroke count should be based on IRG agreed rules (see documentsref. IRG N954AR, IRG N1105, and IRG N2221) regardless of the actual shape with the final decision lies in IRG Chief Editor.
- e) First Stroke (primary radical) for components other than the radical, from 1 to 5 as listed in Annex K. Assignment of first stroke should be based on IRG agreed rules (see documentsref. IRG N954AR, IRG N1105, and IRG N2221), regardless of the actual shape with the final decision lies in IRG Chief Editor. If the technically correct (aka semantic) radical for an ideograph hampers its discoverability, or is region-dependent, the primary radical shall be assigned as though made by an ideograph expert who is neither a specialist in the history of the Han script nor familiar with ideograph etymology. The technically correct radical can be assigned as a secondary radical. Both are shown in the code charts, though the primary one serves as the basis for ordering within a CJK Unified Ideographs block.
- f) Total stroke count is an integer indicating the total number of strokes of a character including that of its radical. Assignment of stroke count will be based on IRG agreed rules (see documentsref. IRG N954AR, IRG N1105, and IRG N2221), regardless of the actual shape with the final decision lies in IRG Chief Editor.
Note: IRG will supply IRG total stroke count data to UTC for the Unihan database. But individual submissions for locale specific total stroke count will not be checked by IRG and IRG takes no responsibility for their correctness.
- g) Flag is to show whether the ideograph is traditional (0) or simplified (1).
- h) Ideographic Description Sequence (IDS) (see documentref. IRG N1183).
- i) Similar Ideographs if available (identified by their code points in the standard in the form of U+xxxxx). If there are multiple ideographs, please separate them by a comma. Enter “None” if no known variants; leave the column empty if not checked.
- j) Pronunciation gives phonetic denotation. Multiple pronunciations can be supplied, separated by comma.
- k) Normalization reference gives the normalization rule (rule number) used if this character is a normalized form.
- l) Total number of evidences indicates the number of evidences you supply. IRG encourages multiple evidences to show the meaning and use of this character.
A separate table should be supplied to show information of the evidences. Table should have the following items:
- 1 Character reference should be the source reference number, the same as that in the attribute table.
- 2 File name is for the evidence file. Accepted file types include PNG, JPEG, PDF, WebP so that they can be uploaded individually to IRG’s ORT. Please ensure that all files must be under 1MB and preferably under 400 KB. The source reference is suggested as the file name of the evidence, optionally followed by a “-” (hyphen) and a multi-digit number especially if multiple pieces of evidence are supplied. Example would be GDM-00001-001.jpg.
- 3 Source of the evidence which should indicate Source and page number.
- 4 Source URL is optionally supplied if there are online information available.
- 5 Any other information can be included if submitter considers it useful for reviewers.
- m) Notes can be used to input any useful information for IRG review.
- n) Additional optional Information in text format can be included in additional columns starting from Nn). If a character was submitted in previous working set, the information should be supplied with respective serial number. Examples of additional information include secondary radicals, secondary stroke count, secondary first stroke. Please add separate columns with appropriate column names after the Notes column.
- o) Each submission should include a Microsoft Eexcel file of data description by an assigned IRG document number for each submission. The glyph images should be supplied as a TrueType font. Evidences may be packed into one or more ZIPzip files, with the assigned document number, a hyphen, and the string “evidences,” (followed by an optional hyphen and padded number for multiple ZIPzip files), as the ZIPzip file name(s). Examples of file names are given below:
IRGN1000-glyphs.ttf IRGN1000-evidences-001.zip IRGN1000-evidences-002.zip
Note from IRG Convenor: Item 2.2.1.d.(5).e) above is being removed per Recommendation IRG M64.18 in document IRG N2765 and document IRG N2713.
Note from IRG Convenor: Item 2.2.1.d.(5).g) above was struck out in Version 17, so it will be removed in this version when it is finalized.
Each submission must strictly follow the formats given above so that the data can be imported into the IRG Online Review Tool (ORT). Some sample submissions are referencedprovided in Annex G for reference. A blank form in Microsoft Excel format is available for submitters’ use as a separate document.
Note: Note that it was decided during IRG Meeting #56 that for all future submissions to IRG working sets, all attributes and evidence images should be readily imported into the IRG Online Review Tool (ORT).
Note: IRG is in the process of building an IRG Working Set collection system which will be used for automatic checking the first submissions of a new working set. Reviewers can give comments which is for the sole benefit of the submitters. Submitter can also withdraw characters that are problematic. Only those characters that pass the automatic checking system would be imported to the ORT for formal IRG review process.
- a) Source Reference to indicate the source and the name of the glyph image for tracking. The source reference should begin with a designated WG 2 member body abbreviation (G, H, J, K, KP, M, MY, T, UK, UTC or V) or an international consortium designation (currently, the only one IRG has is SAT for the SAT Project) followed by no more than nine9 characters. It should contain only Latin capital letters and Arabic numbers to indicate the source. Numeric values to indicate the position in a specific source should only be followed by a hyphen (“-”) (Please note that underscores (“_”) must not be used for source reference). AThe purpose of source references and an exhaustive list of source references accepted by ISO/IEC 10646 can be found in the “Current IRG Source Prefixes” table on the IRG Source Prefixes pageare provided in Section 23 of ISO/IEC 10646. See Annex D for details of IRG source reference abbreviations.
- Existing CJK Compatibility Ideographs (Horizontal extension). To add new source references to existing CJK Compatibility Ideographs, a submitter needs to supply the following information, which will be reviewed by IRG before submission to WG 2 to avoid possible problems.
- (1) Table showing the following data for each proposed horizontal extension of CJK Compatibility Ideographs
- a) Code position of the existing UCS CJK Compatibility Ideograph
- b) Glyph(s) of the existing UCS CJK Compatibility Ideograph
- c) Code position of the corresponding UCS CJK Unified Ideograph
- d) Glyph(s) of the corresponding UCS CJK Unified Ideograph
- e) Glyph of the Compatibility Ideograph in the source reference
- f) Glyph of the Compatibility Ideograph to be printed in the appropriate column of CJK Compatibility Ideographs Code Table
- g) New source reference (for detailed format, see Section 2.2.1.d.(5)
- h) Evidence showing why a new source reference for the CJK Compatibility Ideograph needs to be added to UCS (for example,e.g. a national standard showing two distinct code positions for two glyphs that are one and the same)
- (2) TrueType font containing the glyph to be printed in the appropriate column of CJK Compatibility Ideographs Code Table (for detailed format, see Section 2.2.2.b.)
- (1) Table showing the following data for each proposed horizontal extension of CJK Compatibility Ideographs
- Existing CJK Unified Ideographs (Horizontal extension). To add new source references to existing CJK Unified Ideographs, a submitter needs to supply the following information. These characters must be reviewed by IRG before submission to WG 2 to avoid possible problems.
- (1) Table showing the following data for each proposed horizontal extension of CJK Unified Ideographs
- a) Code position of the existing UCS CJK Unified Ideograph
- b) Glyph(s) of the existing UCS CJK Unified Ideograph
- c) Glyph of the CJK Unified Ideograph to be printed in the appropriate column of CJK Unified Ideographs Code Table
- d) New source reference (for detailed format, see Section 2.2.1.d.(5).a)
- e) Evidence showing why a new source reference for the CJK Unified Ideograph needs to be added to UCS (for examplee.g., a national standard showing the relevant glyph)
- (2) TrueType font containing the glyph to be printed in the appropriate column of CJK Unified Ideographs Code Table (for detailed format, see Section 2.2.2.b).
- (1) Table showing the following data for each proposed horizontal extension of CJK Unified Ideographs
- Ideographic Variation Database (IVD). For unifiable characters, which may be present in an IRG working set and identified as such, members are strongly encouraged to register them as IVSes in a new or existing IVD collection according to the procedures described in UTS #37 (Unicode Ideographic Variation Database). If IRG approves and authorizes the registration of the IVSes in a new or an existing IVD collection, registration fees will not be charged.
Note: IRG does not consider it appropriate to maintain an IRG unified IVD collection. However, IRG is willing to help review individual submissions to UTC.
It should be noted that separate encoding of variant characters should be discouraged. The use of IVD registration is the more appropriate mechanism for encoding variants of already encoded characters. This rule applies to submission of all new IRG working sets after IRG Meeting #53.
Note from IRG Convenor: All subsequent references to Section 23 of ISO/IEC 10646, which became Section 24 in ISO/IEC 10646:2020 (Sixth Edition), and Annex D of this document will be replaced by references to the “Current IRG Source Prefixes” table on the IRG Source Prefixes page. Section 24 of ISO/IEC 10646 will soon reference UAX #38 for this information, hence these changes change.
- 2.2.2. Required Font to be Submitted
IRG Meeting #56 has agreed that submitter to a new working set must supply a TrueType font. Glyph image files are no longer accepted. Font file names should follow the requirement given in Section 2.2.1.d.(5).b). General font specification can be found under point 5 of A.1. - Submitter’s Responsibilities in Annex A of WG 2 Principles and Procedures.
- 2.2.3. Required Evidence to be Submitted
Supporting Evidence: Evidence of the proposed glyph shape, its usage and context with pronunciation(s), meaning(s), and so onetc. should be supplied to convince IRG that it is actually in use or non-cognate with other similar ideographs. The appearance of a character as a head entry in a dictionary is generally considered evidence of actual use if the dictionary is listed in the “Current IRG Source Prefixes” table on the IRG Source Prefixes pageAnnex D or is otherwise accepted by IRG as an authoritative source.
- Evidence for each character must be supplied as scanned images. The provision of evidence on character usage including uses for personal names should not be exempted. A declaration for character use without accompanying evidence is generally not acceptable. Considering privacy issues, IRG has suggested some compromised provisions. Details are given in Annex G.3.
Note: To support E-Ge-government related initiatives, IRG may at its discretion accept submissions of characters that are used in computer systems administered by government bodies for public service with wide access by government agencies and citizens. Factors considered for such acceptance are further elaborated in Annex G.4.
- Questionable Characters (optional): For candidate ideographs with possible unification questions, in addition to listing the possible unifiable characters as required in Section 2.2.1.d.(5).i), submitters are encouraged to provide for review detailed evidence of use from authoritative sources, and evidence showing their relationship to other standardized ideographs or variants having similar shape or meaning. Characters with this information are not counted as problem characters for quality assurance assessment given in Section 2.2.5.
Note: Section 2.2.1.g considers submission of a variant of an encoded character with similar shape to be a mistake, unless other sufficient justifications for encoding are supplied.
- Avoidance of Derived Simplified Ideographs: To avoid encoding derived simplified characters that are not in actual use, submissions of derived simplified ideographs require actual usage evidence. Providing only their corresponding traditional ideographs will not be considered as producing usage evidence. Derived simplified characters from a dictionary as a source should not be used as the sole evidence of actual use unless the dictionary is an IRG-accepted authoritative dictionary.
- Evidence for each character must be supplied as scanned images. The provision of evidence on character usage including uses for personal names should not be exempted. A declaration for character use without accompanying evidence is generally not acceptable. Considering privacy issues, IRG has suggested some compromised provisions. Details are given in Annex G.3.
- 2.2.4. Required Summary Form to be Submitted
Each submission for an ideograph proposal should be accompanied by a duly completed “Proposal Summary Form for Addition of CJK Unified Ideographs to the Repertoire of ISO/IEC 10646” (see Annex F).
- 2.2.5. Quality Assurance: The 5% Rule
For all character encoding standards, a common general principle is to encode the same character once and only once.
- Before any submission, it is the submitter’s responsibility to filter out ideographs that are already in the ISO/IEC 10646 international coding standard:
- - the published standard
- - any of its published amendments
- - any of its amendments under ballot in JTC 1/SC 2
- - IVD, or
- - any of the working sets of IRG
- It is the submitter’s responsibility to supply sufficient evidence for its semantics and use. In addition to the requirement of clear images for all evidences, submitters are asked to supply evidences according the to rules suited for its submission:
- (1) Complete page evidence showing the submitted character should be provided if possible (clipping image showing only a small incomplete section of the relevant text should be avoided);
- (2) If text relating to the submitted character extends over more than one page, images of all relevant pages should be provided;
- (3) Complete references for source evidence should be supplied (author, title, publisher, year, page number, and so onetc.) for each evidence image;
- (4) The red square used to highlight the submitted character should avoid touching any part of the submitted character;
- (5) Characters found in quotations from classical or pre–20th-century texts in a modern typeset edition should also provide an image of an original edition of the text in order to be sure that the character form given in the modern edition is not an error form;
- (6) Characters for which the evidence is a pre-modern woodblock printed text should if possible provide at least two evidence images from different sources (may comprise the same or different content) in order to demonstrate that the character is not a one-off error in a single edition; and
- (7) Characters used in captions and subtitles can be accepted as supporting evidence if agreed by IRG experts. Acceptance of these evidences by IRG shall consider the authoritativeness of the multimedia material, the popularity of the material, cultural influences, and other factors that warrants its acceptance.
Note: Currently, IRG mainly accepts evidence from printed material if they are accepted as IRG sources. In general, IRG does not accept multimedia material as IRG sources, but that the acceptance of evidence from captions and subtitles may warrant the acceptance of some multimedial materials as IRG sources in the future.
In assessing the suitability of a proposed ideograph for encoding, IRG will evaluate the credibility and quality of the submitter’s proposal. If IRG finds more than 5% of the submitter’s source set are either duplicates of characters in the above mentioned proposals during IRG review process, the whole submission will be removed from the subsequent IRG working drafts for that particular IRG project. Suitable rules in Section 2.2.5.b should also be followed. Evidence that are of poor quality or insufficient evidence for review will be rejected in the first instance and it is also counted as poor submission in the 5% rule. However, the 5% rule does not apply if the submitter explicitly raises questions about unification/dis-unification for concrete cases in the proposal of characters.
It should be noted that the 5% rule is a general yardstick to remind submitters to adhere to IRG submission requirements and do a good screening job before submission to reduce the workload of reviewers for quality review. In practice, most submissions should have problems within the 1% range. In this regard, submitters should not interpret the rule as submissions with problems within the 5% range will definitely be accepted. IRG has the right to review the problem cases and decide not to accept a submission even if it has problems within the 5% range (especially when the figure is very close to 5%).
- If a submission has many quality issues validated by analysis of submission data during the review process of the current working set, IRG can impose a capped submission size to the submitter in the next IRG working set. Quality issues include a high rate of unification issues, ill-formed IDSes, evidence quality, rejection rate, and so on.
- Before any submission, it is the submitter’s responsibility to filter out ideographs that are already in the ISO/IEC 10646 international coding standard:
- 2.2.1. Basic Rules of Submission and Required Data to be Submitted
- 2.3. Principles on Production of IRG Working Drafts
After IRG accepts submissions based on principles specified in Section 2.2, and follows the guidelines to form working sets of the current collection, the development process of the current working set begins. IRG Chief Editor and the IRG ORT manager will first produce a set of IRG working drafts.
Note: In case of multiple working sets in a collection, the review will be conducted for one working set at a time. Once a working set is completed and submitted to WG 2, the review of the next working set will start. The process repeats until all working sets are finished in sequence.
- 2.3.1. Principles on Submitted Ideographs
- All the original ideograph submissions, including submissions of glyph font, IDS, radical (primary), stroke count (primary), first stroke (primary), total stroke count, secondary radical (optional and stroke count and first stroke if secondary radical is provided), and evidence, must have registered IRG document numbers.
- If any required information is missing, IRG Chief Editor and IRG ORT Manager can ask for additional information from the submitter. Without timely supply of such information, the submission may be rejected by IRG Chief Editor in producing the working drafts. This is permitted provided the total number of such cases are extremely small. IRG Chief Editor and/or IRG ORT Manager should report such cases to IRG for quality assurance purpose. Based on the quality report, IRG may apply the 5% rule for rejection.
- 2.3.2. Principles on Assignment of Serial Numbers
- IRG Chief Editor will consolidate and sort the submitted ideographs in accordance with Annex A of this document.
- A unique serial number will be assigned to each submitted ideograph after consolidation. The serial numbers must be unique throughout the standardization process. They must not be changed, re-set or re-assigned unless there is an agreed dis-unification during the process. This principle allows easy reference to past discussions. In case of a split, one ideograph will keep the original serial number and the other will be assigned a new serial number.
- If ideographs submitted by different submitters are obviously unifiable, such ideographs may be unified and assigned the same serial number by IRG Chief Editor.
- 2.3.3. Principles on Machine-checking of IDSes of Submitted Ideographs
- IRG Chief Editor or his/her designate will check the submitted IDSes with existing IDS data to detect possible unifiable or duplicate ideographs.
- Machine checking sometimes detects obviously non-unifiable pairs. In such cases, when detected, they will be noted and assigned with different serial numbers before proceeding to the next stage.
- The IDS checking algorithm would satisfy the requirements described in Annex B.
- 2.3.4. Production of IRG Working Drafts
- Division of Character Subsets: By the result of IDS checking, submitted ideographs will be grouped into the following two subsets:
- (1) M-set (main working set): for ideographs with proper IDSes and found not to be unifiable with current standardized ideographs or previously discussed ideographs with proper IDSes. The working procedure is that initially all ideographs in the current working set will be included in this set. For ideographs with questionable attribute data and/or validity of character and/or evidence identified by experts in the review process and the problems of which cannot be resolved during IRG discussions for further information, they can be moved to the D-set (discussion set). Characters that are unifiable with standardized characters or deemed problematic can be withdrawn by submitters.
- (2) D-set (discussion set): Ideographs with questionable attribute data and/or validity of character and/or evidence raised by experts in the review process and cannot be resolved during IRG discussions for further information, are moved from M-set to D-set as decision of IRG as postponed characters for follow up actions to supply further information. Ideographs in the D-set should be withdrawn by submitters if further information cannot be supplied in the next IRG meeting.
- Naming of Working Drafts: The file name should follow the format of “nIRGNNNNN-WS####vVX#.#[XXX]” whereby NNNN is the IRG-assigned document number and X is the version number. No spaces are allowed, but. But, the use of underscore “_” and period “.” for separation is permissible. Examples of version numbers are “n2632-WS2021v6IRGN2444WS2017V5.2,” “n2690-WS2024v1IRGN2480WS2017V6.1Draft,” and so onetc.
- Glyph Images: An archive of consolidated glyph images will be produced from the font file in the ORT so that glyph changes in the fonts can be compared.
- Addition of Characters: No ideographs should be added to a working set once the development process begins.
- Alteration of Characters: Alteration of characters is generally not allowed because it indicates instability and may have impact on other characters in the collection. However, submitters may submit proposals of minor alterations of characters either as a result of IRG recommendation or self initiated with justifications with explicit approval from IRG if the altered glyphs are unifiable with the character glyphs in the original submission. A change of glyph beyond the Annex S (and IRG UCV list) unification criteria is considered to be an addition of a new character and is NOT acceptable during the development process. The submitter of any alteration proposal must provide the results of thorough checks and verification showing that the alteration does not affect other characters in existing standards and working sets. IRG, based on its evaluation, may decide to accept the alteration, reject the proposal or request the withdrawal of such a character by the submitter. If the submitter finds that the glyph of a character is wrong at any working stage, the character will be rejected by IRG and should be withdrawn by the submitter.
- After consolidation, IRG Chief Editor can ask IRG editors and contributing experts (collectively referred to as reviewers) to review M-set and D-set based on an agreed IRG review schedule and task division.
- Division of Character Subsets: By the result of IDS checking, submitted ideographs will be grouped into the following two subsets:
- 2.3.1. Principles on Submitted Ideographs
- 2.4. Principles on Reviewing IRG Working Drafts
If IRG instructs reviewers to review the working drafts (different portions may be assigned to different reviewers), reviewers should submit review result according to the agreed schedule, preferably using ORT. They should follow the principles set out below during the review process.
- 2.4.1. General Principles on Reviews
- Each reviewer should check the ideographs of the current working set assigned by The IRG Chief Editor for the following issues:
- (1) Correctness of Kangxi radical, Kangxi index, stroke count, total number of strokes, first stroke, and IDS.
- (2) Correctness and quality of glyphs and, source information, along with the (as well as quality of evidence imagesfiles at the initial stage for quality assurance purpose,) if necessary.
- (3) Presence of duplicate or unifiable ideographs based on Annex S guidelines as well as examples in IWDS.
- (4) Consistency of submitted characters with the submitted evidence and documentary proof.
- When any data of an ideograph, including IDS, Kangxi radical (primary), stroke count (primary) or first stroke (primary), total strokes, and secondary radical if available(and their stroke count an first stroke) are found to be incorrect, they should be corrected during IRG meeting. Questionable characters with respect to these attributes and to the evidence should be moved from M-set to D-set as their standing data are no longer valid. Until the ideograph is confirmed to be unique by manual checking (procedure described in Section 2.4.2 below), it should not be moved back to M-set.
- Each reviewer should check the ideographs of the current working set assigned by The IRG Chief Editor for the following issues:
- 2.4.2. Principles on Manual Checking
- Duplication and Unification: For D-set ideographs, reviewers should ensure that they are not duplicates of or unifiable with any ideograph in the standard, working set(s) submitted to WG 2, or in the current working set.
- Radical Checking: Assurance is done by enumerating all possible radicals of a target ideograph and looking for any duplicate or unifiable ideographs in the range of ±2 stroke counts of ideographs in the standard, working set(s) submitted to WG 2, and current working set. For example, U+805E 聞 may have the radical of 門 with 6 strokes for the remaining component, or the radical of 耳 with 8 strokes for the remaining component. In such a case, checking the standard, working set(s) submitted to WG 2, and current working set for ideographs with radical of 門 and 4–8 strokes, or ideographs with radical of 耳 and 6–10 strokes manually can better assure that the ideograph does not have duplicate or unifiable ideographs. If a secondary radical information is also available, the same process should be done for the secondary radical.
- Recording of Review Results: The checking work should be recorded in the review comments as “Checked against all ideographs in the standard, working set(s) submitted to WG 2, and current working set with radical X and stroke count of Y±2.”
- 2.4.3. Submission of Possibly Unifiable Ideographs
- Preparation of Comments: Reviewers should prepare comments and feedback quoting the assigned serial numbers of the ideographs in question. In the ORT, try to use pulldown menu to select any known issues. If no selected item is available, put comment in remark field. For offline (with respect to ORT) reviews, comments writing should also be as standardized as possible. The guidelines on comments are described in Section 4 of this document. Comment files should be tabulated in CSV text format, Microsoft Excel, or Microsoft Word file format. All offline review comment files should use the pre-assigned IRG document number in the current version with Source of reviewer appended to the file name.
- Additional Evidence and Arguments: For each proposed ideograph in the D-set that has been questioned for possible unification, the submitter should prepare response with further evidence of its use and documentary proof (for example, from dictionaries, legal documents or other publications) showing that it is not unifiable with any standardized ideograph or ideograph proposed in the same or another working draft. When submitters disagree with a suggested unification, it is insufficient to simply point out that there is no unifiable component variations (UCV) examples. The UCV list and other working documents in the IWDS, which is updated as a result of IRG Working Set reviews, should also be used. As documents in IWDS will be updated during IRG reviews during IRG meeting, “No case in UCV list” is no longer a sufficient reason for dis-unification. IRG requests submitters to also provide unification and dis-unification examples from published versions of the standard as well as those accepted for publication. The additional information will help IRG to determine whether the ideographs in question should be dis-unified or unified (which may result in additional UCVs or other related documents).
- Submission Deadlines: Each reviewer should submit review comments at least two months before the next IRG meeting. IRG Chief Editor will consolidate them and register the results as IRG documents one month before the next IRG meeting.
- Written Responses from Submitters: Submitters should examine the consolidated comments on their respective characters and send The IRG Chief Editor a written document containing their responses to the comments together with additional evidence at least one week before the next IRG meeting either using ORT or document using the current working document appended with “ResponseXXX” whereby XXX is the source submitter designation.
- Rejection: Questioned ideographs with no counter arguments supplied to IRG meeting will either be moved to D-set or withdrawn.
- Revised Font: In case of a glyph mismatched to evidence or mismatched to normalization for consistency, the submitter needs to provide a revised font to IRG for review for acceptance/rejection. The revision is only accepted if it is a unifiable change. Or its change has no impact to other characters in the working set (including corresponding revised attribute data).
- 2.4.1. General Principles on Reviews
- 2.5. Principles on Discussions at IRG Meetings
- 2.5.1. Record-based Discussion
For efficient and smooth work, all discussion items and evidence must be presented as registered IRG documents or registered in ORT before the commencement of an IRG meeting. Items or evidence that are not contained in a registered IRG document (or on the ORT) will not be discussed or treated as evidence during IRG meetings.
- 2.5.2. Discussion Procedure
Discussions will be based on the review comments of the current working set. This includes two parts. The first part is based on comments and response for questionable characters in M-set. The second part is on response/feedback of characters in D-set.
- (1) For unification issues. Submitters should present evidence documents showing that the suspected unifiable ideographs are distinctively used as non-cognate characters in the same region, or that they cannot be unified in accordance with Annex S (and IWDS). When IRG has reached a consensus that two ideographs are unifiable, the submitter concerned should take one of the following actions, and the decision must be recorded.
- Withdraw the now-unified ideograph and add a new source reference to the existing standardized or working set ideograph. This is particularly important if the existing standardized or working set ideograph to which it is unified does not yet have an assigned source reference that corresponds to the submitted and now-unified ideograph.
- Register the now-unified ideograph in a new or an existing IVD collection as a new IVS, particularly if the existing standardized or working set ideograph to which it is unified already has an assigned source reference that corresponds to the submitted and now-unified ideograph.
- To avoid misunification of ideographs, IRG may specify etymological constraints to the application of a particular UCV rule, meaning thati.e. an etymological relationship must be proven between a proposed character and an encoded or another Working Set character for the rules to be applicable. An etymological constraint will only be specified for a particular UCV rule when one or more of the suggested unifiable forms may typically be etymologically related to another radical or component, thus the rule is at high risk of causing misunification of unrelated character. The unification shall only apply when there is sufficient evidence to prove that the two characters in discussion are etymologically related, meaning thati.e. the proof of burden lies in the reviewer instead of the submitter. A proposed character will also not be postponed unless there is reasonable doubt that the character is etymologically related with another encoded or Working Set character.
- Even if an etymological relationship can be proven to exist between two characters, the non-cognate rule still applies, meaning thati.e. as long as there is sufficient evidence to show that the two characters are used with mutually exclusive semantics in a certain language or region, these characters will not be unified. Because shape analysis alone cannot indicate non-cognateness or semantic differences, it is the submitter’s responsibility to provide information and supporting evidence in order to invoke the non-cognate rule.
IRG at its discretion can allow a character discussion be labeled as “pending” with specified time for response during meeting if submitter considers the additional information can be supplied quickly. This allows for some off-line discussion, and make the progress of the discussion more efficiently. If a response is not prepared at the specified time, the pending character will remain in the D-set or withdrawn by the submitter.
Discussions on evidence or items raised after the commencement of an IRG meeting may be postponed to the next IRG meeting if any submitter (or reviewer) requests longer time to examine such evidence or items (remain in the D-set).
- (2) For radical and related attributes. When characters are reviewed by different people, different choices of Kangxi radical, stroke count or first stroke code are possible for the same ideograph. IRG should agree on the most appropriate ones based on the commonest abstract shape of the specific glyph. When the Kangxi radical or stroke count of an ideograph is found to be incorrect, the ideograph will be moved to D-set for another manual review to prevent any unification errors caused by not having conducted the review with ideographs having the correct Kangxi radical or stroke count.
IRG recognizes that radical assignments by different submitters may be different as radical use is locale dependent. As each character must be assigned a unique radical for sequencing in code chart production, IRG will determine the radical used for IRG working set sequence based on Kangxi custom or the most appropriate one determined at IRG if multiple assignments are supplied. Submitters are also given the option to provide secondary radicals (and its corresponding stroke count and first stroke). If there is no unifiable characters from multiple submissions, IRG only make suggestion for the appropriate radical. In this case, the submitter decides the radical to be used.
Guidelines on typical comments and resolutions are given in Section 4 of this document.
- (1) For unification issues. Submitters should present evidence documents showing that the suspected unifiable ideographs are distinctively used as non-cognate characters in the same region, or that they cannot be unified in accordance with Annex S (and IWDS). When IRG has reached a consensus that two ideographs are unifiable, the submitter concerned should take one of the following actions, and the decision must be recorded.
- 2.5.3. Recording of Discussions
Comments, rationales, and decisions must be recorded in ORT for each ideograph reviewed for reference and checking. Document should also be produced so that it can be made available to all reviewers.
- 2.5.4. Time and Quality Management
Before a discussion begins, the number of ideographs under review will be counted and the schedule will be estimated based on it. During the discussion, the number of comments reviewed per hour will be noted and the schedule will be adjusted according to the progress (Note: It is recognized that some comments may take longer than others to discuss and resolve). If the comments cannot be handled in one IRG meeting, they may be partitioned and resolved in subsequent IRG meetings. Due to the limited time the editorial group has to deal with individual characters during an IRG meeting, submitters and reviewers can use emails to discuss and reach agreement on simple, straightforward cases before and after an IRG meeting.
- 2.5.1. Record-based Discussion
- 2.6. Principles on Submission of Ideographs to WG 2
- 2.6.1. Checking of Stabilized M-set
- Once M-set is consolidated and stabilized, the ideographs in M-set will be checked intensively as a complete set at least once to ensure data and glyph integrity.
- Approval by a majority vote of IRG expertreviewers is needed before the set can be prepared for WG 2 submission.
- 2.6.2. Preparation for WG 2 Submission
After the approval by IRG, The IRG Chief Editor with the help of the ISO/IEC 10646 Project Editor will prepare the proposal to be forwarded to WG 2. The preparation includes the following:
- Sort the final stable M-set ideographs by the sorting algorithm described in Annex A.
- Assign provisional UCS code positions to the sorted M-set ideographs (with agreement from the ISO/IEC 10646 Project Editor on block assignment).
- Make available all the attribute data and prepare for the WG 2 submission summary form.
- Make available the TrueType font by each submitter with mapping to assigned provisional UCS code positions provided by the IRG Chief Editor with verification from submtters (fonts have to be available in accordance with the requirement stated in point 5 of A.1. — Submitter’s Responsibilities in Annex A of WG 2 Principles and Procedures). Each submitter should prepare and submit its own font to the ISO/IEC 10646 Project Editor for best font quality.
- Prepare a list of source references.
- Produce a packed Multi-column Ideograph Chart using the TrueType fonts.
IRG will conduct at least one round of review of the proposal and the chart generated using TrueType font before submission to WG 2.
- 2.6.1. Checking of Stabilized M-set
3. Procedures
This section describes the basic development procedure of CJK Unified Ideograph extensions. The ultimate purpose of the procedure is to realize the production of high quality CJK Unified Ideograph sets in an efficient manner.
The basic development procedure described in this section consists of eight stages, and it may take two to three years to create a high quality ideograph set for standardization.
- 3.1. Call for Submission
- When a submitter representing government, an organization or an international consortium requests a new project for CJK Unified Ideograph extension or IRG’s current collection is near completion, IRG may decide at an IRG meeting to call for submission of new ideographs to form a new collection. IRG will specify the deadline for submission.
- IRG will give a name to the new collection using the year of the call (in four-digit format). The first IRG collection using this naming convention is IRG Working Set 2015 that was standardized as the CJK Unified Ideographs Extension G block in ISO/IEC 10646:2020 (Sixth Edition).
- Each submitter with proposed ideographs must submit the ideographs before the specified deadline with the required data described in Section 2 of this document.
- Submitters must ensure that the ideographs are submitted with all the required information. If only minor problems are encountered, such as some required information is found missing or misplaced, the IRG Chief Editor may ask the submitter to re-submit the information or supply additional information. Otherwise, the submission may be rejected because consolidation with other submissions cannot be carried out.
- An initial review of all submissions for a new collection will be done during an IRG meeting to estimate the size of the collection. Each submitter is allowed to submit no more than 2,5001,000 characters. As the normal work set size is set at 10,0004,000, IRG will use the guidelines given in Annex L to estimate the number of working sets for the collection in case the total number of characters is much larger than 10,0004,000. If multiple working sets are deemed necessary, IRG meeting will determine how the collection will be split so that IRG can work on one working set at a time. The decision on the number of working sets may be revised after the review of the first consolidation in an IRG meeting as stated in Section 3.4.f.
- 3.2. Consolidation and Grouping of Submitted Ideographs
Consolidation of submissions is normally done between IRG meetings. The consolidation involves the following tasks:
- IRG Chief Editor will sort and assign serial numbers to submitted ideographs as described in Section 2.3.2.
- After serial numbers have been assigned, submitted ideographs must undergo IDS checking to detect any duplication and unification. Based on the result of IDS checking as described in Section 2.3.3, submitted ideographs will be grouped into M-set and D-set as described in Section 2.3.4.
- After consolidation, the working drafts will be assigned an IRG document number with a version number. They will be distributed to contributing experts and made available on the official IRG website so that any other experts can have access to them. IRG Chief Editor may assign ORT editors and other contributing reviewers to check M-set and D-set ideographs for either the entire working set or certain portions of it depending on a reasonable estimation of the workload.
- Once the consolidation is done, the IRG Chief Editor will sent the data file to IRG ORT Manager. The consolidated data will be uploaded to ORT to be ready for online review by ORT editors. It needs to be emphasized that submissions must strictly follow submission rules to ensure data and evidences can be uploaded into ORT. ORT is a working platform, not meant as a learning platform. Only designated ORT editors who are either IRG editors or experts who are familiar with IRG review process and have made active contributions are granted access to ORT at the permission of IRG Convenor. Reviews can still be conducted in the traditional paper based review and comments can also be submitted in the manner described in Section 3.3 to Section 3.7 below. Online reviews follow the same review process described in Section 3.3 to Section 3.7 except the feedbacks are written online and IRG ORT Manager will export the review comments to IRG Chief Editor for consolidation if written reviews are submitted.
- 3.3. First Checking Stage
This stage, which is between IRG meetings, involves the following tasks:
- Each reviewer must check the assigned M-set and D-set for data integrity, correctness, missing data and duplication. Checking for unification is not mandatory, but desirable. Typical review comment examples for each set are provided in Section 4.
- Off-line reviewers must submit their comments in registered IRG documents to IRG Chief Editor at least two months before the next IRG meeting or according to IRG approved working schedule.
- IRG Chief Editor will consolidate the comments with the help of IRG ORT Manager and produce a registered IRG document for circulation and discussion at least one month before the next IRG meeting or according to IRG approved working schedule.
- Submitters must submit their response to questionable characters either in ORT or in a registered IRG document one week before the next IRG meeting. All experts are encouraged to prepare and submit supplementary documents (with IRG document numbers) so that they can be discussed at the next IRG meeting. For offline comments, IRG ORT Manager can opt to import them into ORT to facilitate discussion in the next IRG.
- 3.4. First Discussion and Conclusion Stage
This stage, which is during an IRG meeting, involves the following tasks:
- All participating experts should review the comments which are officially submitted before the meeting either in the ORT or with assigned IRG document numbers. The editorial group must reach conclusion for each commented ideograph with written records. Guidelines for typical conclusions are provided in Section 4.
- All the conclusions must be agreed and endorsed by IRG plenary in its recommendations. As a result of the recommendations, some ideographs may be withdrawn or moved between M-set and D-set.
- IRG Chief Editor with the help of IRG ORT Manager will create a new version of the M-set and D-set and the list of withdrawn characters one month after IRG meeting.
- If more than 5% of the ideographs submitted by a specific submitter are removed as a result of duplication or unification with existing standardized ideographs, the entire submission of this submitter will be removed to ensure high quality of the project. This is known as the 5% rule described in Section 2.2.5 above.
- If new unification or non-unification cases or rules are agreed upon, such decisions must be recorded in a separate editorial report document. IRG should also instruct the IWDS Eeditor/co-editor to modify and update the IWDS according to the guideline set out in Annex E of this document. This update will be reviewed and confirmed in the following IRG meeting so that it can be used in all future work. The 5% rule will not be applicable to unifications based on newly agreed unification rules.
- If decision on the number of working sets is not made in Section 3.1.e when the collection is first submitted, IRG will review the size of the collection and determine the number of working sets at this stage. The working sets will go through the review process one at a time from Section 3.2 to Section 3.8 until all working sets are completed.
- 3.5. Subsequent Checking Stage
This stage, which is between IRG meetings, involves the following tasks:
- Reviewers must check the newly created M-set and D-set for correctness and duplication.
- Reviewers should submit their comments either in ORT or in registered IRG documents to IRG Chief Editor at least two months before the next IRG meeting or according to IRG approved working schedule.
- IRG Chief Editor will consolidate the comments and produce a registered IRG document for circulation and discussion at least one month before the next IRG meeting or according to IRG approved working schedule.
- Submitters must submit response to questionable characters one week before the next IRG meeting. No response to questioned characters will be postponed in the coming IRG meeting. All experts are encouraged to prepare and submit supplementary documents to facilitate discussion during the next IRG meeting.
- 3.6. Subsequent Consolidation and Conclusion Stage
This stage, which is during an IRG meeting, involves the following tasks:
- Participating experts in the editorial group must review the comments and draw conclusion for each ideograph. Typical comment and conclusion examples for each set are provided in Section 4.
- All the conclusions must be agreed and endorsed by IRG plenary in its resolutions. As a result of the resolutions, some ideographs may be removed or moved between M-set and D-set.
- IRG Chief Editor will create a new version of M-set and D-set one month after IRG meeting, and produce a registered IRG document. The same update will be made in ORT by IRG ORT Manager.
- If more than 5% of the ideographs submitted by a specific submitter are removed as a result of duplication or unification with existing standardized ideographs, the entire submission of this submitter will be removed to ensure high quality of the project. This rule will not be applicable to new unifications based on rules added after the first checking stage.
Note from IRG Convenor: Item 3.6.d above was struck out in Version 17, so it will be removed in this version when it is finalized.
- 3.7. Final Checking Stage
This final stage, which is between IRG meetings, is a decision of a previous IRG meeting. This stage involves the following tasks:
- All reviewers are requested to check M-set intensively based on comments and conclusions made at all previous stages. At the final checking stage, all characters in the D-set of previous IRG meeting will be considered withdrawn characters. No ideographs are allowed to be moved from D-set to M-set although ideographs in the M-set can still be moved to D-set if problems are found.
- Reviewers must submit their comments in ORT or in registered IRG documents to IRG Chief Editor at least two months before the next IRG meeting.
- IRG Chief Editor will consolidate the comments and produce a registered IRG document for circulation and discussion at least one month before the next IRG meeting so that reviewers can have time to review them before the next IRG meeting. ORT will also be updated by IRG ORT Manager.
- Submitters who have questionable ideographs in the consolidated comments should submit their written response one week before the next IRG meeting.
- 3.8. Approval and Submission to WG 2
This stage, which is during an IRG meeting, involves the following tasks:
- Participating experts should review the comments on M-set and reach conclusion for each ideograph.
- If there is no positive decision on an M-set ideograph, it will be withdrawn.
- With the approval of the majority of IRG expertreviewers, M-set is considered frozen as the new ideograph extension set to be submitted to WG 2. IRG Chief Editor with the help of the ISO/IEC 10646 Project Editor will prepare the submission in accordance with Section 2.6 of this document.
Once M-set is frozen as completed for submission to WG 2, records of characters in the D-set will no longer be maintained by IRG. Characters remained in the D-set can be re-submitted in future working setsextensions if pending problems are resolved. Reference to the serial number of the previous working set should be supplied in the new working set.
4. Guidelines for Comments and Resolutions on Working Sets
Generally speaking, reviewers should put down their comments for any problems they want to alert other reviewers. For comments related to glyph shape, the relevant component(s) of the problem glyph and the referenced glyph(s) should be marked in red circles/boxes in the comment files. Similarly, for comments concerning identical or different components of two or more ideographs, the corresponding components should be indicated in red circles/boxes in the comment files.
Note: For editors and experts using ORT to review data, most of the comments are standardized as selections and thus this can speed up the review process and easier to consolidate.
All comments must be accompanied with date (in YYYY-MM-DD format) and the designated IRG abbreviation (G, H, J, K, KP, M, MY, SAT, T, UK, UTC, or V or Z). All conclusions must be dated.
- 4.1. Guidelines for M-set
The ultimate target of M-set is a standardized ideograph set. As such, it must be carefully examined. If any suspicious characters are found, they will be moved to D-set or removed from the current working set altogether.
For comments on glyph shape, the relevant components of the ideographs should be marked in red circles/boxes in the comment file as shown below.
Similarly, for comments concerning identical or different components of two or more ideographs, the corresponding components should be marked in red circles/boxes in the comment file as shown below.
The table below gives examples of review comments and possible actions associated with these problems:
Possible Comment by a Reviewer Possible Resolutions Wrong or Missing Glyph The wrong glyph is corrected, or the missing glyph supplied if evidence is provided. The ideograph can also be moved to D-set for manual checking if insufficient information is provided by the submitter. Wrong Kangxi radical / stroke count / first stroke Data will be corrected if agreement is made. Otherwise, the ideograph will be moved to D-set for further manual checking. Wrong IDS IDS will be corrected and the character will be moved to D-set for checking by the IDS checker. Move to D-set (in case IDS cannot be corrected). May be unifiable with U+xxxxx (standardized ideograph) Unified with U+xxxxx and the submitter will request a new source reference to U+xxxxx. Unified with U+xxxxx and the submitter will request that this character be treated as a Compatibility Ideograph. Unified to U+xxxxx and this entry will be removed. (May consider to register it as IVS.) Not unifiable. May be unifiable with xxxxx (M-set ideograph) Unified with xxxxx and this source reference will be attached to xxxxx. Unified with xxxxx and the submitter may consider registering it as a Compatibility Ideograph Character or an IVS. Not unifiable. - 4.2. Guidelines for D-set
Ideographs in D-set are either the ones that cannot be checked automatically by the IDS checking algorithm or the ones whose attribute data have been questioned by reviewers or whose unification with other ideographs in the standard, working set(s) submitted to WG 2 or current working set has been proposed. For those ideographs that cannot be machine-checked by IDS matching, at least two non-submitter reviewers must check them manually to ensure that they are not unifiable with any ideographs in the standard, working set(s) submitted to WG 2, or current working set. For those ideographs that might be unifiable with other ideographs, the submitters are requested to prepare arguments and evidence to show that such ideographs should be separately encoded.
Possible Comment by IDS Checker Possible Conclusions Incomplete IDS / IDS with extra character / component is not an ideograph IDS will be corrected and the character will be moved to M-set when next IDS-checking is done. Proper IDS cannot be generated and manual checking is needed. Possible Comment by a Reviewer Possible Conclusions Wrong Kangxi radical / stroke count / first stroke Data will be corrected. Proposal to correct data is not accepted, as it is an ambiguous case and IRG agrees that the original data are more appropriate. Wrong IDS IDS will be corrected and checked by the IDS checker again. Correct IDS cannot be generated and manual checking is needed. May be unifiable with U+xxxxx (standardized ideograph) Unified with U+xxxxx and a new source will be added to U+xxxxx. The new candidate entry should be deleted. Not unifiable, as shown by the evidence IRG Nxxxx. Move to M-set. May be unifiable with xxxxx (M-set or D-set ideograph) Unified with xxxxx in M-set and a new source will be added to xxxxx. The new candidate entry should be deleted from D-set. Unified with xxxxx in D-set and a new source will be added to xxxxx. The new candidate entry should be removed from D-set. Not unifiable, as shown by the evidence IRG Nxxxx. Move to M-set. Checked against all ideographs in the standard, working set(s) submitted to WG 2 and current working set with radical X and stroke count of Y±2 for characters that cannot be described by IDS for automatic checking. Move to M-set, as two non-submitter reviewers (XX and YY) confirmed that this ideograph is not unifiable with any existing ideographs in the standard, working set(s) submitted to WG 2, or current working set. Checking against ideographs with radical X may not be enough. This ideograph will also be checked against ideographs with radical Z. Evidence does not match glyph New evidence must be supplied or the character will be moved to D-set. Evidence not clear Character will be moved to D-set unless a clear evidence is supplied.
Note from IRG Convenor: WG 2 is no longer accepting proposals for new CJK Compatibility Ideographs, so it seems prudent to strike out the items in the table above. The last CJK Compatibility Ideographs to be encoded were U+FA2E and U+FA2F in ISO/IEC 10646:2012 (Third Edition).
5. IRG Website
IRG maintains its own website at https://www.unicode.org/irg/, hosted by the Unicode ConsortiumDepartment of Computer Science and Engineering at The Chinese University of Hong Kong. IRG meeting notices, recommendations, document register, documents, and standing documents, and other resources are made available on this website. The IRG Document Register page provides a document search feature.Hyperlinks to WG 2 websites are provided for reviewers’ easy access. For faster retrieval of documents and searching, documents should not be compressed as far as possible and the site search engine window should be made available. Documents larger than 4MB must be split into multiple files for easy uploading, downloading and searching. Compressed files arecan be in ZIPeither WinZip format with a “.zip” extension or RAR format with .rar extension.
6. IRG Document Registration
All documents to be formally discussed by IRG must be registered with IRG document numbers assigned by IRG Convenor and contain the submission date, title, name of the submitter or author, purpose (or summary), and the “IRG Repertoire Submission Summary Form” (when applicable).
- 6.1. Registration Procedure
The following gives the registration procedure:
- Request for Document Number: All documents submitted to IRG must havebe given a document number. The number is to be assigned by IRG Convenor. The submitter should first contact IRG Convenor to requestfor a document number, along with providing a document title. Once the document number is assigned, the information will be posted to theon IRG document registerwebsite. Document numbers maycan be pre-assigned during IRG meetings for submissionactivities between IRG meetings.
- Submission of Documents: All registered documents must be submitted to IRG Convenor. All documents must include an IRG document number in the form IRG Nxxxx whereby xxxx is the assigned four-digit number.The submitted documents must contain an assigned IRG document number in text form (except files of pure tables to avoid interfering with the data presented in the table) so that searching can be supported. Note: Feedback to and comments on a given document will not be assigned a new document number. Rather, they will use the same document number with extensions to facilitate tracking for the same topic.
- File Size: While there is no actual file size limit for documents that are posted to the IRG document register, submitters are encouraged to keep file sizes to a minimum to the extent possible. Documents that include large attachments—or a large number of attachments—may include a download link in the IRG document register for a separate ZIP file. If the file size of the ZIP file is exceedingly large (over 100MB), which is usually the case for evidence images for IRG working set submissions, the size of the ZIP file will be shown in parentheses. DO NOT use ZIP files as attachments to PDF files, because some security settings may prevent such attachments from being extracted.Documents larger than 4MB must be split into multiple files for easy uploading, downloading and searching. The compressed files can be in either ZIP format with “.zip” extension or RAR format with “.rar” extension. Files that are much larger than 4MB, and not easy for splitting, IRG has both Google drive and Weiyun to store them. However, the submitters must inform IRG Convenor at the time of file submission. A short cover page to describe the content of the large file should also be submitted. IRG Convenor will upload the cover page to the IRG server and the actual large files to Google Drive and Weiyun (for access in China) using accounts managed by the IRG convenor. Links to these large files will be made available on the IRG website.
- Posting of Documents: Properly submitted documents are then posted by IRG Convenor to theon IRG document registerwebsite as official documents, and the submitters will be notified by IRG Convenor by email. The submitters should double-check the posted documents upon receiving the emails to ensure that the intended documents are properly posted for viewing by the public. In case of a large file, the submitter must first provide a method (for examplee.g., web download, FTPftp) for IRG Convenor to obtain the file. IRG Convenor will then post it on IRG large-file posting sites and provide the link(s) on IRG website for download.
- Disqualified Documents: Documents with certain basic information missing, such as the submitter’s name, title, or purpose or files that are not in the appropriate size may be rejected by IRG Convenor for posting. All other documents that fail to comply with the above registration process and the preliminary review by IRG Convenor for basic information will not be treated as IRG documents. As such, issues contained in such documents will not be formally discussed by IRG formally.
- Document Format: Submitted documentsDocuments submitted by submitters should use the most commonly-used document format for easy viewingreading by members on all platforms. Static documents should use PDF. Data files intended for consolidation, revision, and processing can use other appropriate document formats depending on the nature of data, such as Microsoft Excel, CSV, or plain text, or PNG.
- 6.2. Contact for IRG Document Registration
Note from IRG Convenor: The items above were adjusted to reflect the current procedure.
The current IRG Convenor is Dr Ken LUNDEProf. Qin LU and whoseher contact information is shown belowas follows:
Dr Ken Lunde
Apple
One Apple Park Way
MS 953-1DES
Cupertino, CA 95014
USA
Mobile: +1-408-515-2618
Email: lunde@unicode.org
Annex A: Sorting Algorithm of Ideographs
IRG recognizes that the choice of radicals, the sequence of strokes, and the stroke counting methods are regionlocale dependent. Submitters may also have different preferences forof character orderings. However, for the convenience of IRG editorial work, IRG must adopt a sorting order which may be different from the submitters’ preferences. Thus the principles of sorting of ideographs given below are internal for IRG editing purposes only. Ideographs consolidated for unification review must be sorted according to the following order:.
- Kangxi Radical Order
Note: Ideographs with the simplified radicals listed below must be orderedplaced after ideographs with the corresponding traditional radicals.
Traditional Radicals Chinese Simplified Radicals Non-Chinese Simplified Radicals R090.0 爿 R090.1 丬 R120.0 糸 R120.1 纟 R147.0 見 R147.1 见 R149.0 言 R149.1 讠 R154.0 貝 R154.1 贝 R159.0 車 R159.1 车 R167.0 金 R167.1 钅 R168.0 長 R168.1 长 R169.0 門 R169.1 门 R178.0 韋 R178.1 韦 R181.0 頁 R181.1 页 R182.0 風 R182.1 风 R182.2 𲋄 (⿻𠘨二) R183.0 飛 R183.1 飞 R184.0 食 R184.1 饣 R187.0 馬 R187.1 马 R195.0 魚 R195.1 鱼 R196.0 鳥 R196.1 鸟 R197.0 鹵 R197.1 卤 R199.0 麥 R199.1 麦 R201.0 黃 R201.1 黄 R205.0 黽 R205.1 黾 R208.0 鼠 R208.2 鼡 R210.0 齊 R210.1 齐 R210.2 斉 R211.0 齒 R211.1 齿 R211.2 歯 R212.0 龍 R212.1 龙 R212.2 & R212.3 竜 & 𱷥 (⿱立兆) R213.0 龜 R213.1 龟 R213.2 亀 Note from IRG Convenor: A row for Radical #201 was added, because the UCD (Unicode Character Database) CJKRadicals.txt data file includes an entry for its Chinese simplied form as follows:
201'; 2EE9; 9EC4
. Also, a second non-Chinese simplified radical was added for Radical #212, which is already supported by the ORT (212.3). Six ideographs in Unicode Version 17.0 specify this second non-Chinese simplified radical, one of which specifies it as its primary radical, and is ordered as such in the CJK Unified Ideographs Extension J block (U+33479; aka IRG Working Set 2021 #00027). - Stroke Count
Note: For ideographs with multiple stroke count values, the first value will be used for sorting purposes.
Note: Simplified characters must be placed after traditional characters within the same stroke-number group.
Note from IRG Convenor: We no longer do this. Also see Section 2.2.1.d.(5).g). Hence, the original note will be removed.
- First Stroke
IRG Chief EThe technical editor will assign the first stroke based on documents IRG N954AR and IRG N1105. In case of previously unseen components, the IRG Chief Etechnical editor will use the value 6 per Annex Ktake the conventions of Kangxi for first stroke assignment without regard to the submitters’ locale conventions.
Annex B: IDS Matching
- B.1. Guidelines on Creation of IDSes
Each submitter should consult document IRG N1183 on IDSes. In addition to the Character Description Components (CDCs) defined in document IRG N1183, all CJK Unified Ideographs accepted by ISO/IEC 10646 and in its amendments are also qualified as CDCs in constructing IDSes.
The use of “overlapping” Ideographic Description Characters (IDCs) or more than four IDCs is considered to be “inappropriate” and may not be a subject toof IDS comparison.
- B.2. Requirements of IDS Matching
The IDS matching algorithm used by IRG should support the following features:
- Handling different split points.
(For example, ⿰亻頃 and ⿰化頁 should be matched.)
- Handling different split levels.
(For example, ⿰亻悉 and ⿰亻⿱釆心 should be matched.)
- Matching different glyphs of the same abstract shape.
(For example, ⿰礻申 and ⿰示申 should be matched.)
- Matching similar glyphs.
(For example, ⿰忄生 and ⿰小生 should be matched.)
- Matching IDSes with different orderings of overlapping IDCs.
(For example, ⿻三丨and ⿻丨三 should be matched.)
- Matching unifiable IDC patterns.
(For example, ⿰麥离 and ⿺麥离 should be matched.)
- Handling any combinations of the above.
- Detecting any inappropriate IDSes, such as IDSes being too long, IDSes with non-ideographic CDCess, or missing or extra CDCs or IDCess.
- Handling different split points.
- B.3. Limitation of IDS Matching
It should be noted that IDS matching cannot detect unification or duplication if a component cannot be encoded by an IDS, or if the glyph itself is very complex. IDS matching is done algorithmically. It is not versatile in detecting unifiable ideographs unless rules are explicitly given to the algorithm. Thus, it is not meant to be athe replacement forof manual checking. Rather, it is an assistive tool for quality assurance purposes to identify duplication and known cases of unification. Therefore, it is very important for submitters to make sure that their submitted ideographs are not going to be unified with any standardized or previously discussed ideographs in IRGor working set ideographs.
Annex C: Urgently Needed Ideographs
- C.1. Introduction
When a WG member body or an internationally-recognized organization, consortium, or individual, as a submitter, demonstrates an urgent need for a small number of ideographs to be standardized for justifiable reasons, such as ideographs in a recently developed regional or national standard that must be implemented by a particular deadline, IRG may submit the ideographs, independent of the current IRG working set to WG 2. Each urgently needed submission will be treated as a separate urgently needed repertoire, and a submitter can have no more than one active urgently needed submission at a time. The process will be started only sparingly with demonstrated need.
- C.2. Requirements
Each submission should include no more than 30 ideographs. Submissions of more than 30 characters will be accepted at the sole discretion of IRG. A submitter of urgently needed ideographs must prepare the following
- All the documents required for normal ideograph submissions.
- Justifications of the submission. A submission is deemed urgently needed only if the submitter demonstrates urgency or a rationale for rapid standardization. Evidence of current use is not in and of itself evidence of urgent need. The type of use also needs to be taken into account. For example, requirements of government, industry, science, or scholarship will generally be taken as evidence of urgent need.
- A document that indicates whether, among the submitted urgently needed ideographs, there are any ideographs that can be unified with ideographs in the current IRG working set in addition to those in the standard or its amendments. When a particular urgently needed repertoire is accepted by WG 2, any unifiable ideographs in the current working set will be removed as explained in Section C.3 below.
- For the rest of the submitted urgently needed ideographs, the document must prove that they are not unifiable with any ideographs in the current working set. The proof may be provided by listing the documents the submitter has checked, and for each proposed ideograph, a list of ideographs whose radicals and strokes have been checked against. It is an important responsibility of the submitter to check with not only the standardized CJK ideographs, but also the working set(s) submitted to WG 2 for consideration and IRG working set for any unifiable characters against its submission. If a submitter fails to do the above, the submission will not be approved by IRG as an IRG-endorsed independent submission to WG 2.
- C.3. Dealing with Urgent Requests
Accepted urgently needed ideographs as independent submissions must be checked by IRG for correctness, duplication and unification against the latest published ISO/IEC 10646 as well as the current IRG working set. When an urgently needed ideograph is found to be identical or unifiable with any ideograph in the current IRG working set, the latter must be noted and removed from the current IRG working set.
Annex D: Up-to-date CJK Unified Ideograph Sources and Source References
The IRG Source Prefixes page provides the most current IRG sources, source prefixes, and source prefix descriptions, along with information about future IRG source prefixes, usually based on submissions for an IRG working set that is currently under review.
See Section 2.2.1.d.(5).a) for more information about constraints on the format of IRG source references.
This Annex is retained for historical purposes.
Note from IRG Convenor: The content of this Annex is being removed per Recommendation IRG M64.18 in document IRG N2765 and document IRG N2714R, and all references to this ANnex are being replaced with a reference to the “Current IRG Source Prefixes” table on the IRG Source Prefixes page.
Annex E: Maintenance Procedure of IRG Working Document Series
- E.1. Introduction
IRG Working Document Series (IWDS) is a set of IRG-maintained documents which keep the up-to-date examples of CJK unification related cases to supplement the published Annex S of ISO/IEC 10646 for IRG unification work.
- E.2. IRG Working Document Series
The formats of the IWDS and the specific lists of examples are maintained as a separate set of documents as follows:
- Series 1: IRG Principles and Procedures (P&P), First Residual Stroke (FS) values, and Stroke Count (SC) values
- Series 2: List of Unifiable Component Variations (UCV) and Non-Unifiable Component Variations (NUCV) of ideographs
- Series 3: Assigned and unassigned CJK Unified Ideographs and CJK Compatibility Ideographs code points in the Unicode Standard (CJKUI); the original Series 3, NUCV, was merged with Series 2
- Series 4: List of Disunified Ideographs (DUI); the original Series 4, list of possibly Mis-Unified Ideographs (MUI), which included only a single document (IRG N1395) from IRG Meeting #29, was determined to be obsolete
- Series 5: Guidelines for normalizing handwritten ideographs to printed ideographs
Note from IRG Convenor: The descriptions of each IWDS series was synchronized with those on the IWDS page.
- E.3. Maintenance Procedure
The maintenance procedure describes how entries in the IWDS are added, removed, or changed. IRG has an appointed IWDS Editor (currently Mr. Yi BAITaichi Kawabata), and IWDS co-editor (Mr. Yi Bai, appointed since IRG #56), who isare in charge of the maintenance of the IWDS.
In principle, all update requests are results of IRG unification review work. A review cycle between two IRG meetings is needed. Every update must be discussed in at least one IRG meeting and confirmed in writing. An update normally starts from the unification review work assigned to reviewers in the past IRG meeting (Meeting No. N−1). During the review work before the next IRG meeting (Meeting No. N), if reviewers find duplicates, unifiable cases or mistakes which warrant a change in the IWDS, they need to report these cases in a specific form attached to IWD Series 1. These reported cases will then be consolidated by IRG Chief Editor before IRG Meeting No. N. During IRG Meeting No. N, time must be allocated to discuss these reported cases and conclusions must be recorded during this IRG meeting. Based on the confirmed conclusions on IWDS updates, IWDS Editor or IRG Convenor will update the IWDS. Any unclear conclusions will be further discussed in future meetings.
It takes time to update the IWDS, and sometimes it is difficult to find appropriate examples. IRG has, therefore, requested IWDS Editor to keep a log of the actions carried out based on IRG instructions so that better tracking of changes can be carried out.
See the Microsoft Excel file named “AnnexE.xlsx” in the attachments ZIP file to viewBelow is the description of the maintenance procedure as a flow chart.
Annex F: IRG Repertoire Submission Summary Form
See the Microsoft Word file named “AnnexF.docx” in the attachments ZIP file for the IRG Repertoire Submission Summary Form.
Annex G: Examples of New CJK Unified Ideographs Submissions (akai.e., Vertical Extensions)
- Sample Data Files
All submitted characters must follow the submission format describedgiven in Section 2.2.1.d. See the data files that are attached to the submissions for the most recent IRG working set in the “Submitters” column of the first table in the “IRG Working Sets” section of the IRG home page for examples.The following gives a sample list of characters submitted by UK for consideration in IRG Collection 2021.
See the Microsoft Excel file named “SubmissionTemplate.xlsx” in the attachments ZIP file as a template for preparing the required data file.
- Sample Evidence
All character submissions must include evidence of use as specified in Section 2.2.3. See the evidence images and their descriptions in the ORT (Online Review Tool) for the most recent IRG working set in the “Working Set” column of the first table in the “IRG Working Sets” section of the IRG home page for examples.The following shows an example of a Japanese submission with reference to the use of the character in ancient books (IRG N1225 Part2).
- Handling of Data with Privacy Concerns
IRG understands that privacy laws and practices in some countries and regions can make the submission of complete records as evidence related to personal information difficult. As a compromise, IRG suggests submitters to provide evidence in such a way that it will not reveal complete personal/internal information. However, the character information itself must be shown in the supplied evidence. In other words, partial document images should be supplied with certain sensitive information redactedblocked.
As different departments/organizations may have different types of documents, IRG suggests that, for each type of document, a submitter should provide a sample document with any private information deleted. A good example is the original Basic Certificate of Family Relation Register in Korea as shown in Fig. G1. The evidence can be submitted as partial data in the form shown in Fig. G2.
Figure G1. The original Basic Certificate of Family Relation Register
Figure G2. A modified Basic Certificate of Family Relation Register (private information such as full name and date of birth has been deleted)
- Consideration for Acceptance of Characters that Cannot be Provided in Printed Form
The consideration for acceptance of characters that cannot be provided with evidence in printed form is not meant to relax IRG requirement in the provision of evidence of modern use. Rather, it is meant to facilitate E-Ge-government initiatives for computerized processing of information. It is under this presumption IRG is willing to consider acceptance of characters already supported in computer systems that are maintained by a designated government body with wide use by government bodies and citizens for administrative public service.
IRG recognizes that some of the characters included in E-Ge-government systems cannot be provided with supporting evidence of actual use according to Section 2.2.3.a, and yet it is technically and administratively not practical to remove them from the systems. Thus IRG is willing to consider their acceptance without actual evidence provided that they are from already implemented working systems only. However, IRG requires the submitter to provide information on the quality assurance process for the maintenance of the character collection concerned. The submitter must supply information on the accessibility of the character collection and the working system, the stability and traceability of the collection, and the kind of evidence/information needed for approval of character removal, modification and addition by the administrative body of the collection.
Annex H: [Reserved for future use]
Annex H is purposely left out for the time being so that IRG Annexes I and J numbers synchronizetally with WG 2 Annexes I and J numbers whoseere the subjects are the same.
This Annex is retained for historical purposes.
Annex I: Guideline for Handling of CJK Ideograph Unification and/or Dis-unification Error
See Annex I of WG 2 Principles and Procedures.
This Annex is retained for historical purposes.
Note from IRG Convenor: It is error-prone and a maintenance burden to maintain the same content in two places, so in lieu of repeating Annex I from WG 2 Principles and Procedures here, it is more efficient and practical to simply link to it as done above.
Annex J: Guideline for Correction of CJK Ideograph Mapping Table Errors
See Annex J of WG 2 Principles and Procedures.
This Annex is retained for historical purposes.
Note from IRG Convenor: It is error-prone and a maintenance burden to maintain the same content in two places, so in lieu of repeating Annex J from WG 2 Principles and Procedures here, it is more efficient and practical to simply link to it as done above.
Annex K: List of First Strokes
Below gives the list of first strokes including their glyphs and names in English and Chinese (with pinyin provided).
Glyph | Stroke Number | Name | Name in Chinese | Pinyin |
---|---|---|---|---|
一 | 1 | Horizontal bar | 橫 | héng |
丨 | 2 | Vertical bar | 豎 | shù |
丿 | 3 | Slash | 撇 | piě |
丶 | 4 | Dot | 點 | diǎn |
乙 | 5 | Turn | 折 | zhé |
Note that if a character has no residuale strokes besides the radical, the value 0 should be used, and if its residual stroke value is unclear, the value 6 should be used.
Note from IRG Convenor: The value 6 was added per Recommendation IRG M64.18 in document IRG N2765 and document IRG N2715R.
Annex L: Guidelines for Forming Working Sets with an Upper Limit
As stated in Section 2.2.1.d.(1), IRG sets an upper limit for the number of ideographs in a working set to ensure sufficient time for delivering quality output in a timely manner. The current limit (LimitIRG) is set to about 10,0004,000 ideographs. Also, each submission should not exceed 2,500go beyond 1,000 ideographs. Since the number of submissions and their repertoire sizes may differ each time when a new collection is formed, IRG needs some basic guidelines on how the working sets can be formed in a fair manner to accommodate various needs. This Annex serves for this purpose.
At the start of the development work, submitters submit their proposals. Let us assume that the number of submissions is N.
If the total number of ideographs is less than LimitIRG (or reasonably close to LimitIRG), all submissions will be used to form the working set of this collection.
If the total number of ideographs is much larger than LimitIRG, setting an upper limit to each submission is needed. The general principle based on simple mathematic calculation is given below:
- Scenario 1: The simplest case solution is that the number of ideographs from each submission should not exceed Limitsingle_submission, whereby Limitsingle_submission = LimitIRG / N. This works especially well when all submission sizes are larger than Limitsingle_submission.
- Scenario 2: In case there is a submission with a total number of ideographs (TOTALsingle) less than Limitsingle_submission, the spare quota, Sparesingle_submission = Limitsingle_submission − TOTALsingle, can be equally divided among those submissions which exceed Limitsingle_submission. If spare quota remains afterwards, it can be divided under the same principle (recursively) among submissions which can still take the unused quota.
Even though the above mathematical method can set a quick and undisputed limit to each submission, it may not be the best solution when considering the practical needs of the submitters for different applications. Submitters are encouraged to subdivide their submission and give them priority levels with explanation and justifications. IRG can consider these justifications and agree on a division of LimitIRG close to the one given in the mathematical model above with minor modifications.
It should be noted that the upper limit, LimitIRG, is indicative and set based on IRG’s experience from past reviews that targeted for a three-one year review cycle. Minor modifications to this limit are allowed because unification among submissions and the withdrawal of characters by submitters can potentially reduce the total number of characters eventually included as in the repertoire for WG 2 submission.
If the current collection is too large to form a single working set, IRG will use the above principle to split the current collection into multiple working sets, and work on each of them in sequence. In this case, the subdivided sets will be named using the original call name with a subset name assigned in alphabetic sequence, such aseg. IRG Working SetCollection 2015A, IRG Working SetCollection 2015B, and so onetc.
References
Document numbers in the first column in the following table refer to IRG working documents (ISO/IEC JTC 1/SC 2/WG 2/IRGNxxxx), except where noted otherwise. For documents with no link, one may try http://www.cse.cuhk.edu.hk/~irg/ ; some older documents may only be available in paper form (contact IRG Convenor Prof. Qin LU).
Doc Number | Subject | Source | Date |
---|---|---|---|
IRG N681 | Annex S (ISO/IEC 10646-1:2000 Excerpt) | PETERSON, Bruce; IRG Rapporteur | 1999-11-18 |
IRG N881 | CJK Extension C Submission Format | IRG | 2001-12-04 |
IRG N953 | Minutes (IRG Meeting #20) | IRG | 2002-11-21 |
IRG N954 | Report on first stroke/stroke count by ad hoc group | IRG | 2002-11-21 |
IRG N954AR | First residual strokes & stroke counts | IRG | 2002-11-21 |
IRG N955 | IRG Radical Classification | Ideographic Radical Ad Hoc | 2002-11-21 |
IRG N956 | Ideograph Unification | Ideographic Radical Ad Hoc | 2002-11-21 |
IRG N1105 | Amendments to IRG N954AR | IRG | 2005-01-03 |
IRG N1183 | Guidelines on IDS Decomposition | Japan | 2005-12-26 |
IRG N1197 | Sample evidences for CJK C1 candidates | Japan | 2006-05-22 |
JTC 1 N8557 | ISO/IEC JTC 1 Directives, 5th Edition, Version 3.0 (aka SC 2 N3933) | JTC 1 Secretariat | 2007-04-05 |
IRG N1372 | On better usage of IDS for IRG development process | KAWABATA, Taichi; KOBAYASHI, Tatsuo | 2007-11-09 |
WG 2 N4502 | Principles and Procedures for Allocation of New Characters and Scripts and handling of Defect Reports on Character Names | WG 2 | 2014-01-28 |
IRG N2221 | Stroke Counting Guidelines | IRG | 2017-07-24 |
IRG N2857 | Annex S (ISO/IEC 10646:2020 Excerpt) | WG 2 | 2020-12 |
Note from IRG Convenor: I was able to find the documents that were missing in Version 17 and posted them to the IRG document register, and I also used the opportunity to update and add documents. I still feel that some references can be removed or added.
Glossary
Abstract shape: Ideographic characters are used as symbols to represent different entities and used for different purposes. The same character conceptually can sometimes be written in different actual shapes with minor stroke differences, due to preference, which do not affect the recognition of the character as a unique symbol. These characters having the same abstract shapes are not encoded separately because ISO/IEC 10646 is a character (symbol) standard, not a glyph standard. In other words, character glyphs (actual shapes) that are considered to have the same abstract shapes are to be unified under the CJK unification rules (defined in Annex S of ISO/IEC 10646).
As ideographs are formed by both the components and the relative positioning of the components, the examination of glyph difference is observed by taking into consideration the meaning, components, and their relative positions. Characters having different meanings and different actual shapes are not considered to have the same abstract shapes. Characters having the same components yet different in relative positions are generally considered to have different abstract shapes. However, component difference is subjected to examination by experts to see if they have influenced the recognition of the character as a whole with consideration of the character’s origin and use. Annex S of ISO/IEC 10646 has defined the examination procedure which is given below:
The following features of each ideograph to be compared are examined:
- a) the number of components,
- b) the relative position of the components in each complete ideograph,
- c) the structure of corresponding components.
If one or more of the features a) to c) above are different between the ideographs in the comparison, the ideographs are considered to have different abstract shapes and are therefore not unified.
If all of the features a) to c) above are the same between the ideographs, the ideographs are considered to have the same abstract shape and are therefore unified.
Please also refer to Annex S in ISO/IEC 10646 for examples of characters and components that are considered to have the same abstract shape. IRG maintains an up-to-date Unification Examples List.
Character Description Component (CDC): It refers to any symbols that can be used with the Ideograph Description Characters to form an Ideograph Description Sequence. It includes all encoded CJK unified ideographs, Kangxi Radicals, CJK Radical Supplements, and encoded CJK Compatibility ideographs.
CJK Compatibility Ideographs: Compatibility ideographs are defined in Section 1918 of ISO/IEC 10646. Below is a direct quote from ISO/IEC 10646:2020 (Sixth Edition)2017:
CJK Unified Ideographs: It refers to the collection of unified Han characters in ISO/IEC 10646 standard. CJK stands for Chinese, Japanese and Korean. The term CJK Unified Ideographs was adopted in the earlier years of IRG to reflect the development work of the Han character unification from the three languages at that time. It is obvious today that Han unification covers far beyond the scripts used in China, Japan and Korea. However, the term continues to be used in the standardization process and has not changed.
The CJK Compatibility ideographs are ideographs that should have been unified with one of the CJK Unified ideographs, per the unification rule described in Annex S. However, they are included in this document as separate characters, because, based on various national, cultural, or historical reasons for some specific country and region, some national and regional standards assign separate code points for them.
D-set (discussion set): D-set is the set of characters that have been reviewed by IRG expertreviewers with pending issues which need further discussion/evidence for inclusion in the M-set of a working set.
Ideographic Description Characters (IDC): The 1712 characters defined in ISO/IEC 10646 atstarting from the code points U+2FF0 through U+2FFF and U+31EF: ⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻.
Ideographic Description Sequence (IDS): IDSes describes a characters using theirits components and indicating the relative positions of the components. IDCs are considered operators to the components.
IDSes can be expressed by a context-free grammar through the Backus Naur Form (BNF). The grammar G has four components:
Let G = {Σ, N, P, S}, where
- Σ: the set of terminal symbols including all encoded radicals and encoded ideographs (referred to as CDC, Character Description Components), and the 1712 IDCs.
- N: the set of 65 non-terminal symbols
- N = {IDS, IDS1, Unary_Symbol, Binary_Symbol, Ternary_Symbol, CDC}
- P: a set of rewritinge rules
- S: {IDS}, which is the start symbol of the grammar
The following is the set of rewriting rules P:
- IDS ::= <Unary_Symbol><IDS1> | <Binary_Symbol><IDS1><IDS1> | <Ternary_Symbol><IDS1><IDS1><IDS1>
- <IDS1> ::= <IDS> | <CDC>
- <CDC> ::= encoded_ideograph | encoded_radical | encoded_component
- <Unary_Symbol> ::= |
- <Binary_Symbol> ::= ⿰ | ⿱ | ⿴ | ⿵ | ⿶ | ⿷ | ⿸ | ⿹ | ⿺ | ⿻ | | |
- <Ternary_Symbol> ::= ⿲ | ⿳
Note #1: Even though the IDCs are terminal symbols, they are not part of the CDCs.
Note #2: Other than the binary symbol ⿻ whose use(embedment which indicates the overlay of two components), all the other binary (12) and ternary (2)11 IDCs takes the IDS components (either 2 or 3) in a specific order. The order is indicated in the following table, which includes all IDCs and their code points in code-point order:
U+2FF0 | U+2FF1 | U+2FF2 | U+2FF3 | U+2FF4 | U+2FF5 | U+2FF6 | U+2FF7 |
---|---|---|---|---|---|---|---|
⿰ | ⿱ | ⿲ | ⿳ | ⿴ | ⿵ | ⿶ | ⿷ |
U+2FF8 | U+2FF9 | U+2FFA | U+2FFB | U+2FFC | U+2FFD | U+2FFE | U+2FFF |
⿸ | ⿹ | ⿺ | ⿻ | | | | |
U+31EF | |||||||
|
Note from IRG Convenor: We should seriously consider replacing the grammar and its description above with the grammar that is provided in Section 18.2.1 of the Core Specification of the Unicode Standard:IDS := Ideographic | Radical | CJK_Stroke | Private Use | U+FF1F | IDS_UnaryOperator IDS | IDS_BinaryOperator IDS IDS | IDS_TrinaryOperator IDS IDS IDS CJK_Stroke := U+31C0 | ... | U+31E5 IDS_UnaryOperator := U+2FFE | U+2FFF IDS_BinaryOperator := U+2FF0 | U+2FF1 | U+2FF4 | ... | U+2FFD | U+31EF IDS_TrinaryOperator := U+2FF2 | U+2FF3
Ideographic Variation Database (IVD): A database maintained by the Unicode Consortium to keep registered Ideographic Variation Sequences (IVSes)as standardized ideographs. See IVD and UTS #37 (Unicode Ideographic Variation Database) for more details.
Ideographic Variation Sequence (IVS): A sequence of two encoded characters, the first being a character with the Ideographic property that is not canonically nor compatibly decomposable, the second being a variation selector character in the range from U+E0100 to U+E01EF per UTS #37. The first character is referred to as a base character. The purpose of IVSes is to define specific variant glyphs that are unifiable and otherwise cannot be encoded according to CJK unification rules. The sequence is a representation of a unifiable character glyph that is not identical to the base character. A registered IVS in an IVD collection has a specific glyph shape defined. Once registered, an IVS can be used as though it were aother standardized ideographs.
IRG Collection: An IRG Collection refers to a collection of ideographs consolidated from submissions received upon IRG’s call for proposals. The collection is named using the year of the call in four-digit format. A collection called in 2015 is named IRG Collection 2015. An IRG collection may be split into multiple working sets to warrant effective reviews.
IRG Working Document Series (IWDS): A set of IRG-maintained documents which keep the up-to-date examples of CJK unification related cases to supplement the published Annex S of ISO/IEC 10646 for IRG unification work.
M-set (main set): M-set is the set of characters that have been reviewed and accepted by IRG expertreviewers without pending questions in the current working set.
New Source: Any CJK source that is newly submitted by IRG expertreviewers which is not yet accepted by ISO/IEC 10646, thus is not present in the “Current IRG Source Prefixes” table on the IRG Source Prefixes page/Section 23 of ISO/IEC 10646. Reviewers may first submit their new source to IRG for acceptance. Once accepted, the characters in that source can be accepted by IRG for consideration for inclusion in future extensions. IRG will also submit the source to WG 2 for approval and inclusion in the “Current IRG Source Prefixes” table on the IRG Source Prefixes page/Section 23 of ISO/IEC 10646.
Nonce characters/ideographs: A nonce character is a character created specifically for use in a literary work, or series of works, and not intended for general usage. In some cases these characters may even be copyrighted or registered in some way. Nonce words are common in literature like science fiction as the name of some imaginary gadget or alien race, or a character created for literature, commercial or fiscal use. Nonce words in any language are usually quite short lived and so are not included in ordinary dictionaries unless over many years they have gained wider use.
Normalization: In text processing, a process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence. In the ISO/IEC 10646 standard, normalization refers specifically to processing to ensure that canonical-equivalent (and/or compatibility-equivalent) strings have unique representations. All CJK Compatibility Ideographs normalize to become their cononical equivalents, which are CJK Unified Ideographs. In IRG work, a process of considering alternate forms of Han ideographs, sometimes handwritten, that are ultimately represented using a common, correct, or modern form.
Note from IRG Convenor: This glossary entry was blank in Version 17, so I defined it to the best of my ability. Any and all feedback is welcome.
Regular Script (楷書): Regular script refers to text written in Song style (宋體) and Ming style (明體) which are considered printed forms. It also refers to writing in Kai style (楷體) which is a formal brush style in hand written form. Other more ancient text written formally such as clerical style (隸書) and small seal (小篆) are quite different and are not considered regular script by IRG. Informal writing and artistic expressions written in semi-cursive (行書) and cursive (草書) styles are not considered regular scripts.
Source: A reputable published document such as a dictionary, a standardization document, or a well published and widely read or referenced book which is accepted by IRG as authoritative such that the characters in this source are considered reliable, stable, and suitable for consideration of inclusion. A set of ISO/IEC 10646 accepted sources is listed in the “Current IRG Source Prefixes” table on the IRG Source Prefixes pageSection 23 of the ISO/IEC 10646 document.
Urgently Needed Characters: Urgently Needed Characters are a collection of ideographs submitted by a WG 2 member body or an ISO-recognized organization with a size no bigger than 30 (normally). If the submitter can demonstrate an urgent need for the ideographs to be standardized for justifiable reasons, IRG will accept them for review and endorsement to WG 2 for acceptance independent of the current working set. IRG will not initiate any call for urgently needed characters.
Working set: A working set is the set of characters accepted by IRG as a collection or part of a collection to work on for standardization inextension to ISO/IEC 10646. Characters accepted in a working set are subject to review by IRG expertreviewers for inclusion in a particular extension. IRG uses the year of the call in four-digit format to name its new collections. If a new collection is split into multiple working sets, an additional alphabet letter will be used to name these working sets.
Versions
See the Series 1 — P&P, FS & SC column in the table on the IRG Working Document Series (IWDS) page for a complete listing of the previous approved versions of this document.
Additional interim drafts are available in the IRG document register, specifically for Versions 1, 2, 3, 6, 10, and 12, but they are not listed in that table. That table is intended to list only the latest approved version—or the latest draft, if not approved—of each version of this document. In addition, while documents IRG N1741, IRG N1772, and IRG N1823 (and its three revisions: IRG N1823R, IRG N1823R2, and IRG N1823R3) appear to be various—and conflicting—drafts of Version 5, they are actually drafts of Version 6. Document IRG N1646 is the approved version of Version 5.
Note from IRG Convenor: It seemed better to add this new section rather than to try to exhaustively list previous versions at the very beginning of the document, some of which were incorrect: IRG N1942 should have been IRG N1952, and IRG N2427 should have been IRG N2424.