Chapter 18
East Asia
This chapter presents scripts used in East Asia. This includes major writing systems associated with Chinese, Japanese, and Korean. It also includes several scripts for minority languages spoken in southern China, as well as the historic Khitan Small Script of northern China, and the historic Tangut script.
The characters that are now called East Asian ideographs, and known as Han ideographs in the Unicode Standard, were developed in China in the second millennium BCE. The basic system of writing Chinese using ideographs has not changed since that time, although the set of ideographs used, their specific shapes, and the technologies involved have developed over the centuries. The encoding of Chinese ideographs in the Unicode Standard is described in Section 18.1, Han. For more on usage of the term ideograph, see “Logosyllabaries” in Section 6.1, Writing Systems.
As civilizations developed surrounding China, they frequently adapted China’s ideographs for writing their own languages. Japan, Korea, and Vietnam all borrowed and modified Chinese ideographs for their own languages. Chinese is an isolating language, monosyllabic and noninflecting, and ideographic writing suits it well. As Han ideographs were adopted for unrelated languages, however, extensive modifications were required.
Chinese ideographs were originally used to write Japanese, for which they are, in fact, ill suited. As an adaptation, the Japanese developed two syllabaries, Hiragana and Katakana, whose shapes are simplified or stylized versions of certain ideographs. (See Section 18.4, Hiragana and Katakana.) Chinese ideographs are called kanji in Japanese and are still used, in combination with Hiragana and Katakana, in modern Japanese.
In Korea, Chinese ideographs were originally used to write Korean, for which they are also ill suited. The Koreans developed an alphabetic system, Hangul, discussed in Section 18.6, Hangul. The shapes of Hangul syllables or the letter-like jamos from which they are composed are not directly influenced by Chinese ideographs. However, the individual jamos are grouped into syllabic blocks that resemble ideographs both visually and in the relationship they have to the spoken language (one syllable per block). Chinese ideographs are called hanja in Korean and are still used together with Hangul in South Korea for modern Korean. The Unicode Standard includes a complete set of Korean Hangul syllables as well as the individual jamos, which can also be used to write Korean. Section 3.12, Conjoining Jamo Behavior, describes how to use the conjoining jamos and how to convert between the two methods for representing Korean.
In Vietnam, a set of native ideographs was created for Vietnamese based on the same principles used to create new ideographs for Chinese. These Vietnamese ideographs were used through the beginning of the 20th century and are occasionally used in more recent signage and other limited contexts.
Yi was originally written using a set of ideographs invented in imitation of the Chinese. Modern Yi as encoded in the Unicode Standard is a syllabary derived from these ideographs and is discussed in Section 18.7, Yi.
Bopomofo, discussed in Section 18.3, Bopomofo, is another recently invented syllabic system, used to represent Chinese phonetics.
In all these East Asian scripts, the characters (Chinese ideographs, Japanese kana, Korean Hangul syllables, and Yi syllables) are written within uniformly sized rectangles, usually squares. Traditionally, the basic writing direction followed the conventions of Chinese handwriting, in top-down vertical lines arranged from right to left across the page. Under the influence of Western printing technologies, a horizontal, left-to-right directionality has become common, and proportional fonts are seeing increased use, particularly in Japan. Horizontal, right-to-left text is also found on occasion, usually for shorter texts such as inscriptions or store signs. Diacritical marks are rarely used, although phonetic annotations are not uncommon. Older editions of the Chinese classics sometimes use the ideographic tone marks (U+302A..U+302D) to indicate unusual pronunciations of characters.
Many older character sets include characters intended to simplify the implementation of East Asian scripts, such as variant punctuation forms for text written vertically, halfwidth forms (which occupy only half a rectangle), and fullwidth forms (which allow Latin letters to occupy a full rectangle). These characters are included in the Unicode Standard for compatibility with older standards.
Appendix E, Han Unification History, describes how the diverse typographic traditions of mainland China, Taiwan, Japan, Korea, and Vietnam have been reconciled to provide a common set of ideographs in the Unicode Standard for all these languages and regions.
Nüshu is a siniform script devised by and for women to write the local Chinese dialect of southeastern Hunan province, China. Nüshu is based on Chinese Han characters. Unlike Chinese, the characters typically denote the phonetic value of syllables. Less often Nüshu characters are used as ideographs. Although very few fluent Nüshu users were alive in the late twentieth century, the script has drawn national and international attention, leading to the study and preservation of the script.
The Lisu script was developed in the early 20th century by using a combination of Latin letters, rotated Latin letters, and Latin punctuation repurposed as tone letters, to create a writing system for the Lisu language, spoken by large communities, mostly in Yunnan province in China. It sees considerable use in China, where it has been an official script since 1992.
The Miao script was created in 1904 by adapting Latin letter variants, English shorthand characters, Miao pictographs, and Cree syllable forms. The script was originally developed to write the Northeast Yunnan Miao language of southern China. Today it is also used to write other Miao dialects and the languages of the Yi and Lisu nationalities of southern China.
Tangut is a large, historic siniform ideographic script used to write the Tangut language, a Tibeto-Burman language spoken from about the 11th century CE until the 16th century in the area of present-day northwestern China. Tangut was re-discovered in the late 19th century, and has been largely deciphered. Today the script is of interest to students and scholars.
Khitan Small Script was created about 925 CE, and was one of two scripts used by the Khitan people of Northern China to write the Khitan language during the Liao dynasty, the Qara Khitai empire, and the Jin dynasty. It is only partially deciphered. The script contains logograms and phonograms written in vertical columns, running right to left, similar to how Chinese is traditionally written.
#18.1 Han
#18.1.1 CJK Unified Ideographs
The Unicode Standard contains a set of unified Han ideographic characters used in the written Chinese, Japanese, and Korean languages. The term Han, derived from the Chinese Han Dynasty, refers generally to Chinese traditional culture. The Han ideographic characters make up a coherent script, which was traditionally written vertically, with the vertical lines ordered from right to left. In modern usage, especially in technical works and in computer-rendered text, the Han script is written horizontally from left to right and is freely mixed with Latin or other scripts. When used in writing Japanese or Korean, the Han characters are interspersed with other scripts unique to those languages (Hiragana and Katakana for Japanese; Hangul syllables for Korean).
Although the term “CJK”—Chinese, Japanese, and Korean—is used throughout this text to describe the languages that currently use Han ideographic characters, it should be noted that earlier Vietnamese writing systems were based on Han ideographs. Consequently, the term “CJKV” would be more accurate in a historical sense. Han ideographs are still used for historical, religious, and pedagogical purposes in Vietnam. For more on usage of the term ideograph, see “Logosyllabaries” in Section 6.1, Writing Systems.
The term “Han ideographic characters” is used within the Unicode Standard as a common term traditionally used in Western texts, although “sinogram” is preferred by professional linguists. Taken literally, the word “ideograph” applies only to some of the ancient original character forms, which indeed arose as ideographic depictions. The vast majority of Han characters were developed later via composition, borrowing, and other non-ideographic principles, but the term “Han ideographs” remains in English usage as a conventional cover term for the script as a whole.
The Han ideographic characters constitute a very large set, numbering in the tens of thousands. They have a long history of use in East Asia. Enormous compendia of Han ideographic characters exist because of a continuous, millennia-long scholarly tradition of collecting all Han character citations, including variant, mistaken, and nonce forms, into annotated character dictionaries.
The Unicode Standard draws its unified Han character repertoire from a number of different character set standards. These standards are grouped into a number of sources listed in tables in Appendix E.3, CJK Sources.
Because of the large size of the Han ideographic character repertoire, and because of the particular problems that the characters pose for standardizing their encoding, this character block description is more extended than that for other scripts and is divided into several subsections. The first subsection, “Blocks Containing Han Ideographs,” describes the way in which the Unicode Standard divides Han ideographs into blocks. This subsection is followed by an extended discussion of the characteristics of Han characters, with particular attention being paid to the problem of unification of encoding for characters used for different languages. There is a formal statement of the principles behind the Unified Han character encoding adopted in the Unicode Standard and the order of its arrangement. For a detailed account of the background and history of development of the Unified Han character encoding, see Appendix E, Han Unification History.
#18.1.2 Blocks Containing Han Ideographs
Han ideographic characters are found in several blocks of the Unicode Standard, as shown in Table 18-1.
Block | Range | Comment |
---|---|---|
CJK Unified Ideographs | 4E00–9FFF | Common |
CJK Unified Ideographs Extension A | 3400–4DBF | Rare |
CJK Unified Ideographs Extension B | 20000–2A6DF | Rare, historic |
CJK Unified Ideographs Extension C | 2A700–2B73F | Rare, historic |
CJK Unified Ideographs Extension D | 2B740–2B81F | Urgently needed |
CJK Unified Ideographs Extension E | 2B820–2CEAF | Rare, historic |
CJK Unified Ideographs Extension F | 2CEB0–2EBEF | Rare, historic |
CJK Unified Ideographs Extension G | 30000–3134F | Rare, historic |
CJK Unified Ideographs Extension H | 31350–323AF | Rare, historic |
CJK Unified Ideographs Extension I | 2EBF0–2EE5F | Urgently needed |
CJK Compatibility Ideographs | F900–FAFF | Duplicates, unifiable variants, corporate characters |
CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants |
Characters in the unified ideograph blocks are defined by the IRG, based on Han unification principles explained later in this section.
The two compatibility ideographs blocks contain various duplicate or unifiable variant characters encoded for round-trip compatibility with various legacy standards. For historic reasons, the CJK Compatibility Ideographs block also contains twelve CJK unified ideographs. Those twelve ideographs are clearly labeled in the code charts for that block.
#Extensions to the URO. The initial repertoire of the CJK Unified Ideographs block included characters submitted to the IRG prior to 1992, consisting of commonly used characters. That initial repertoire, also known as the Unified Repertoire and Ordering, or URO, was derived entirely from the G, T, J, and K sources. The repertoire in the CJK Unified Ideographs block has subsequently been extended with small sets of unified ideographs or ideographic components needed for interoperability with various standards, or for other reasons, as shown in Table 18-2. The range U+9FFD..U+9FFF filled the reserved space at the end of this block.
Range | Version | Comment |
---|---|---|
9FA6–9FB3 | 4.1 | Interoperability with HKSCS standard |
9FB4–9FBB | 4.1 | Interoperability with GB 18030 standard |
9FBC–9FC2 | 5.1 | Interoperability with commercial implementations |
9FC3 | 5.1 | Correction of mistaken unification |
9FC4–9FC6 | 5.2 | Interoperability with ARIB standard |
9FC7–9FCB | 5.2 | Interoperability with HKSCS standard |
9FCC | 6.1 | Interoperability with commercial implementations |
9FCD–9FCF | 8.0 | Interoperability with TGH 2013 standard |
9FD0 | 8.0 | Correction of mistaken unification |
9FD1–9FD5 | 8.0 | Miscellaneous urgently needed characters |
9FD6–9FE9 | 10.0 | Ideographs for Slavonic transcription |
9FEA | 10.0 | Correction of mistaken unification |
9FEB–9FED | 11.0 | Ideographs for chemical elements |
9FEE–9FEF | 11.0 | Interoperability with government implementations |
9FF0–9FFC | 13.0 | Zoological, chemical, and geological terms |
9FFD–9FFF | 14.0 | Interoperability with government implementations |
4DB6–4DBF | 13.0 | Corrections of mistaken unifications |
2A6D7–2A6DD | 13.0 | Gongche characters for Kunqu Opera |
2A6DE–2A6DF | 14.0 | Interoperability with government implementations |
2B735–2B736 | 14.0 | Corrections of mistaken unifications |
2B737 | 14.0 | Urgently needed character |
2B738 | 14.0 | Correction of mistaken unification |
2B739 | 15.0 | Urgently needed character |
#Extensions to Other CJK Blocks. Starting with Version 13.0, some of the small repertoire extensions have involved reserved ranges at the end of other CJK blocks. Those ranges are also shown in Table 18-2. The range U+4DB6..U+4DBF filled the reserved space at the end of the CJK Unified Ideographs Extension A block, the range U+2A6DE..U+2A6DF filled the reserved space at the end of the CJK Unified Ideographs Extension B block, and the range U+2B735..U+2B739 used reserved space at the end of the CJK Unified Ideographs Extension C block.
#Han Ideographs for Slavonic Transcription. The URO includes twenty CJK Unified Ideographs, U+9FD6 through U+9FE9, which are used for transcribing Slavonic literary documents into Chinese. Renewed contact between the Russian and Chinese Empires from the 18th to the 20th centuries led to the translation of Slavonic literary documents into both classical and vernacular Chinese. The Russian Mission in Beijing was a driving force behind this effort, and many of these characters were coined by Archimandrite Gurias, who was the head of the 14th Russian Mission (1858–1864). Although some existing CJK Unified Ideographs can be used for transcribing Slavonic, these twenty characters are distinct. Many of these characters are unusual in that they represent syllables not usually found in Chinese.
#Other Large CJK Extensions. Characters in the CJK Unified Ideographs Extension A block are rare and are not unifiable with characters in the CJK Unified Ideographs block. They were submitted to the IRG during 1992–1998 and are derived entirely from the G, T, J, K, and V sources.
The CJK Unified Ideographs Extension B block contains rare and historic characters that are also not unifiable with characters in the CJK Unified Ideographs block. They were derived from versions of national standards submitted to the IRG during 1998–2000. The characters encoded in Extension B may, in some instances, differ slightly from published versions of those standards.
The CJK Unified Ideographs Extension C through I blocks mostly contain rare, historic, uncommon, or urgently needed characters that are not unifiable with characters in any previously encoded CJK Unified Ideographs block. Extension D and Extension I are somewhat unique in that they are made up of urgently needed characters from various regions. Extension C ideographs were submitted to the IRG during 2002–2006. Extension D ideographs were submitted to the IRG during 2006–2009. Extension E ideographs were submitted to the IRG during 2006–2013. Extension F ideographs were submitted during 2012–2015. Extension G ideographs were submitted during 2015. Extension H ideographs were submitted during 2017. Extension I is unique, in that it consists entirely of urgently needed characters from China.
#Principles for Extensions. The only principled difference in the unification work done by the IRG on the unified ideograph blocks is that the Source Separation Rule (rule R1) was applied only to the original CJK Unified Ideographs block and not to the extension blocks. The Source Separation Rule states that ideographs that are distinctly encoded in a source must not be unified. (For further discussion, see “Principles of Han Unification” later in this section.)
The unified ideograph blocks are not closed repertoires. Each may contain a small range of reserved code points at the end of the block. Additional unified ideographs may eventually be encoded in those ranges—as has already occurred in the CJK Unified Ideographs block, as well as in Extensions A through C. There is no guarantee that any such Han ideographic additions would be of the same types or from the same sources as preexisting characters in the block, and implementations should be careful not to make hard-coded assumptions regarding the range of assignments within the Han ideographic blocks in general.
Several Han characters unique to the U source and which are not unifiable with other characters in the CJK Unified Ideographs block are found in the CJK Compatibility Ideographs block. There are 12 of these characters: U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, and U+FA29. The remaining characters in the CJK Compatibility Ideographs block and the CJK Compatibility Ideographs Supplement block are either duplicates or unifiable variants of a character in one of the blocks of unified ideographs.
#IICore. IICore (International Ideograph Core) is a set of important Han ideographs, incorporating characters from all the defined blocks. This set of nearly 10,000 characters has been developed by the IRG and represents the set of characters in everyday use throughout East Asia. By covering the characters in IICore, developers guarantee that they can handle all the needs of almost all of their customers. This coverage is of particular use on devices such as cell phones or PDAs, which have relatively stringent resource limitations. Characters in IICore are explicitly tagged as such in the Unihan Database (see Unicode Standard Annex #38, “Unicode Han Database (Unihan)”).
#UnihanCore2020. UnihanCore2020 is a set of over 20,000 Han ideographs. The set includes 68 compatibility characters necessary for some regions. Like IICore, this set is intended to cover the needs of customers in East Asia, but its repertoire is much larger because of the increased memory and storage capacity of contemporary hardware, including mobile devices. The repertoire of the UnihanCore2020 subset is identified with the kUnihanCore2020 key in the Unihan Database. See Unicode Standard Annex #38, “Unicode Han Database (Unihan)”.
#18.1.3 General Characteristics of Han Ideographs
The authoritative Japanese dictionary Koujien (1983) defines Han characters to be:
...characters that originated among the Chinese to write the Chinese language. They are now used in China, Japan, and Korea. They are logographic (each character represents a word, not just a sound) characters that developed from pictographic and ideographic principles. They are also used phonetically. In Japan they are generally called kanji (Han, that is, Chinese, characters) including the “national characters” (kokuji) such as touge (mountain pass), which have been created using the same principles.
For many centuries, written Chinese was the accepted written standard throughout East Asia. The influence of the Chinese language and its written form on the modern East Asian languages is similar to the influence of Latin on the vocabulary and written forms of languages in the West. This influence is immediately visible in the mixture of Han characters and native phonetic scripts (kana in Japan, hangul in Korea) as now used in the orthographies of Japan and Korea (see Table 18-3).
Han Character | Chinese | Japanese | Korean | English Translation |
---|---|---|---|---|
天 | tiān | ten, ame | chen | heaven, sky |
地 | dì | chi, tsuchi | ji | earth, ground |
人 | rén | jin, hito | in | man, person |
山 | shān | san, yama | san | mountain |
水 | shuǐ | sui, mizu | su | water |
上 | shàng | jou, ue | sang | above |
下 | xià | ka, shita | ha | below |
The evolution of character shapes and semantic drift over the centuries has resulted in changes to the original forms and meanings. For example, the Chinese character 湯 tāng (Japanese tou or yu, Korean thang), which originally meant “hot water,” has come to mean “soup” in Chinese. “Hot water” remains the primary meaning in Japanese and Korean, whereas “soup” appears in more recent borrowings from Chinese, such as “soup noodles” (Japanese tanmen; Korean thangmyen). Still, the identical appearance and similarities in meaning are dramatic and more than justify the concept of a unified Han script that transcends language.
The “nationality” of the Han characters became an issue only when each country began to create coded character sets (for example, China’s GB 2312-80, Japan’s JIS X 0208-1978, and Korea’s KS C 5601-87) based on purely local needs. This problem appears to have arisen more from the priority placed on local requirements and lack of coordination with other countries, rather than out of conscious design. Nevertheless, the identity of the Han characters is fundamentally independent of language, as shown by dictionary definitions, vocabulary lists, and encoding standards.
#Terminology. Several standard romanizations of the term used to refer to East Asian ideographic characters are commonly used. They include hànzì (Chinese), kanzi (Japanese), kanji (colloquial Japanese), hanja (Korean), and Chữ hán (Vietnamese). The standard English translations for these terms are interchangeable: Han character, Han ideographic character, East Asian ideographic character, or CJK ideographic character. For clarity, the Unicode Standard uses some subset of the English terms when referring to these characters. The term Kanzi is used in reference to a specific Japanese government publication. The unrelated term Kangxi (which is a Chinese reign name, rather than another romanization of “Han character”) is used only when referring to the primary dictionary used for determining Han character arrangement in the Unicode Standard. (See Table 18-7.)
#Distinguishing Han Character Usage Between Languages. There is some concern that unifying the Han characters may lead to confusion because they are sometimes used differently by the various East Asian languages. Computationally, Han character unification presents no more difficulty than employing a single Latin character set that is used to write languages as different as English and French. Programmers do not expect the characters “c”, “h”, “a”, and “t” alone to tell us whether chat is a French word for cat or an English word meaning “informal talk.” Likewise, we depend on context to identify the American hood (of a car) with the British bonnet. Few computer users are confused by the fact that ASCII can also be used to represent such words as the Welsh word ynghyd, which are strange looking to English eyes. Although it would be convenient to identify words by language for programs such as spell-checkers, it is neither practical nor productive to encode a separate Latin character set for every language that uses it.
Similarly, the Han characters are often combined to “spell” words whose meaning may not be evident from the constituent characters. For example, the two characters “to cut” and “hand” mean “postage stamp” in Japanese, but the compound may appear to be nonsense to a speaker of Chinese or Korean (see Figure 18-1).
Even within one language, a computer requires context to distinguish the meanings of words represented by coded characters. The word chuugoku in Japanese, for example, may refer to China or to a district in central west Honshuu (see Figure 18-2).
Coding these two characters as four so as to capture this distinction would probably cause more confusion and still not provide a general solution. The Unicode Standard leaves the issues of language tagging and word recognition up to a higher level of software and does not attempt to encode the language of the Han characters.
#Simplified and Traditional Chinese. There are currently two main varieties of written Chinese: “simplified Chinese” (jiǎntǐzì), used in most parts of the People’s Republic of China (PRC) and Singapore, and “traditional Chinese” (fántǐzì), used predominantly in the Hong Kong and Macao SARs, Taiwan, and overseas Chinese communities. The process of interconverting between the two is a complex one. This complexity arises largely because a single simplified form may correspond to multiple traditional forms, such as U+53F0 台, which is a traditional character in its own right and the simplified form for U+6AAF 檯, U+81FA 臺, and U+98B1 颱. Moreover, vocabulary differences have arisen between Mandarin as spoken in Taiwan and Mandarin as spoken in the PRC, the most notable of which is the usual name of the language itself: guóyǔ (the National Language) in Taiwan and pǔtōnghuà (the Common Speech) in the PRC. Merely converting the character content of a text from simplified Chinese to the appropriate traditional counterpart is insufficient to change a simplified Chinese document to traditional Chinese, or vice versa. (The vast majority of Chinese characters are the same in both simplified and traditional Chinese.)
There are two PRC national standards, GB 2312-80 and GB 12345-90, which are intended to represent simplified and traditional Chinese, respectively. The character repertoires of the two are the same, but the simplified forms occur in GB 2312-80 and the traditional ones in GB 12345-90. These are both part of the IRG G source, with traditional forms and simplified forms separated where they differ. As a result, the Unicode Standard contains a number of distinct simplifications for characters, such as U+8AAC 説 and U+8BF4 说.
While there are lists of official simplifications published by the PRC, most of these are obtained by applying a few general principles to specific areas. In particular, there is a set of radicals (such as U+2F94 ⾔ KANGXI RADICAL SPEECH, U+2F99 ⾙ KANGXI RADICAL SHELL, U+2FA8 ⾨ KANGXI RADICAL GATE, and U+2FC3 ⿃ KANGXI RADICAL BIRD) for which simplifications exist (U+2EC8 ⻈ CJK RADICAL C-SIMPLIFIED SPEECH, U+2EC9 ⻉ CJK RADICAL C-SIMPLIFIED SHELL, U+2ED4 ⻔ CJK RADICAL C-SIMPLIFIED GATE, and U+2EE6 ⻦ CJK RADICAL C-SIMPLIFIED BIRD). The basic technique for simplifying a character containing one of these radicals is to substitute the simplified radical, as in the previous example.
The Unicode Standard does not explicitly encode all simplified forms for traditional Chinese characters. Where the simplified and traditional forms exist as different encoded characters, each should be used as appropriate. The Unicode Standard does not specify how to represent a new simplified form (or, more rarely, a new traditional form) that can be derived algorithmically from an encoded traditional form (simplified form).
#Early Forms of Chinese. Prior to the 20th century, the standard form of written Chinese was literary Chinese, a form derived from the classical Chinese that was written, but probably not spoken, by Confucius in the sixth century BCE.
The repertoire of CJK unified ideographs encoded in the Unicode Standard covers modern Chinese, literary Chinese, and classical Chinese.
#Sorting Han Ideographs. The Unicode Standard does not define a method by which ideographic characters are sorted; the requirements for sorting differ by locale and application. Possible collating sequences include phonetic, radical-stroke (Kangxi, Xinhua Zidian, and so on), four-corner, and total stroke count. Raw character codes alone are seldom sufficient to achieve a usable ordering in any of these schemes; ancillary data are usually required. (See Table 18-7 for a summary of the authoritative sources used to determine the order of Han ideographs in the code charts.)
#Character Glyphs. In form, Han characters are monospaced. Every character takes the same vertical and horizontal space, regardless of how simple or complex its particular form is. This practice follows from the long history of printing and typographical practice in China, which traditionally placed each character in a square cell. When written vertically, there are also a number of named cursive styles for Han characters, but the cursive forms of the characters tend to be quite idiosyncratic and are not implemented in general-purpose Han character fonts for computers.
There may be a wide variation in the glyphs used in different countries and for different applications. The most commonly used typefaces in one country may not be used in others.
The types of glyphs used to depict characters in the Han ideographic repertoire of the Unicode Standard have been constrained by available fonts. Users are advised to consult authoritative sources for the appropriate glyphs for individual markets and applications. It is assumed that most Unicode implementations will provide users with the ability to select the font (or mixture of fonts) that is most appropriate for a given locale.
#18.1.4 Principles of Han Unification
#Three-Dimensional Conceptual Model. To develop the explicit rules for unification, a conceptual framework was developed to model the nature of Han ideographic characters. This model expresses written elements in terms of three primary attributes: semantic (meaning, function), abstract shape (general form), and actual shape (instantiated, typeface form). These attributes are graphically represented in three dimensions according to the X, Y, and Z axes (see Figure 18-3).
The semantic attribute (represented along the X axis) distinguishes characters by meaning and usage. Distinctions are made between entirely unrelated characters such as 澤 (marsh) and 機 (machine) as well as extensions or borrowings beyond the original semantic cluster such as 机1 (a phonetic borrowing used as a simplified form of 機) and 机2 (table, the original meaning).
The abstract shape attribute (the Y axis) distinguishes the variant forms of a single character with a single semantic attribute (that is, a character with a single position on the X axis).
The actual shape (typeface) attribute (the Z axis) is for differences of type design (the actual shape used in imaging) of each variant form.
Z-axis typeface and stylistic differences are generally ignored for the purpose of encoding Han ideographs, but can be represented in text by the use of variation sequences; see Section 23.4, Variation Selectors.
#18.1.5 Unification Rules
The following rules were applied during the process of merging Han characters from the different source character sets.
#R1 Source Separation Rule. If two ideographs are distinct in a primary source standard, then they are not unified.
- This rule is sometimes called the round-trip rule because its goal is to facilitate a round-trip conversion of character data between an IRG source standard and the Unicode Standard without loss of information.
- This rule was applied only for the work on the original CJK Unified Ideographs block [also known as the Unified Repertoire and Ordering (URO)]. The IRG dropped this rule in 1992 and will not use it in future work.
Figure 18-4 illustrates six variants of the CJK ideograph meaning “sword.”
Each of the six variants in Figure 18-4 is separately encoded in one of the primary source standards—in this case, J0 (JIS X 0208-1990), as shown in Table 18-4.
Unicode | JIS |
---|---|
U+5263 | J0-3775 |
U+528D | J0-5178 |
U+5271 | J0-517B |
U+5294 | J0-5179 |
U+5292 | J0-517A |
U+91FC | J0-6E5F |
Because the six sword characters are historically related, they are not subject to disunification by the Noncognate Rule (R2) and thus would ordinarily have been considered for possible abstract shape-based unification by R3. Under that rule, the fourth and fifth variants would probably have been unified for encoding. However, the Source Separation Rule required that all six variants be separately encoded, precluding them from any consideration of shape-based unification. Further variants of the “sword” ideograph, U+5251 and U+528E, are also separately encoded because of application of the Source Separation Rule—in that case applied to one or more Chinese primary source standards, rather than to the J0 Japanese primary source standard.
#R2 Noncognate Rule. In general, if two ideographs are unrelated in historical derivation (noncognate characters), then they are not unified.
For example, the ideographs in Figure 18-5, although visually quite similar, are nevertheless not unified because they are historically unrelated and have distinct meanings.
#R3 By means of a two-level classification (described next), the abstract shape of each ideograph is determined. Any two ideographs that possess the same abstract shape are then unified provided that their unification is not disallowed by either the Source Separation Rule or the Noncognate Rule.
#18.1.6 Abstract Shape
#Two-Level Classification. Using the three-dimensional model, characters are analyzed in a two-level classification. The two-level classification distinguishes characters by abstract shape (Y axis) and actual shape of a particular typeface (Z axis). Variant forms are identified based on the difference of abstract shapes.
To determine differences in abstract shape and actual shape, the structure and features of each component of an ideograph are analyzed as follows.
#Ideographic Component Structure. The component structure of each ideograph is examined. A component is a geometrical combination of primitive elements. Various ideographs can be configured with these components used in conjunction with other components. Some components can be combined to make a component more complicated in its structure. Therefore, an ideograph can be defined as a component tree with the entire ideograph as the root node and with the bottom nodes consisting of primitive elements (see Figure 18-6 and Figure 18-7).
#Ideograph Features. The following features of each ideograph to be compared are examined:
- Number of components
- Relative positions of components in each complete ideograph
- Structure of a corresponding component
- Treatment in a source character set
- Radical contained in a component
#Uniqueness or Unification. If one or more of these features are different between the ideographs compared, the ideographs are considered to have different abstract shapes and, therefore, are considered unique characters and are not unified. If all of these features are identical between the ideographs, the ideographs are considered to have the same abstract shape and are unified.
#Spatial Positioning. Ideographs may exist as a unit or may be a component of more complex ideographs. A source standard may describe a requirement for a component with a specific spatial positioning that would be otherwise unified on the principle of having the same abstract shape as an existing full ideograph. Examples of spatial positioning for ideographic components are left half, top half, and so on.
#Examples. The examples in Table 18-5 illustrate the reasons for not unifying characters, including typical differences in abstract character shape.
Characters | Reason |
---|---|
Non-cognate characters | |
Characters treated as distinct in a source character set | |
Different number of components | |
Same number of components placed in different relative positions | |
Same number and same relative position of components, corresponding components structured differently | |
Characters with different radical in a component |
Differences in the actual shapes of ideographs that have been unified are illustrated in Table 18-6.
Characters | Reason |
---|---|
Different writing sequence | |
Differences in overshoot at the stroke termination | |
Differences in contact of strokes | |
Differences in protrusion at the folded corner of strokes | |
Differences in bent strokes | |
Differences in stroke termination | |
Differences in accent at the stroke initiation | |
Difference in rooftop modification | |
Difference in rotated strokes/dots† |
† These ideographs (having the same abstract shape) would have been unified except for the Source Separation Rule.
#18.1.7 Han Ideograph Arrangement
The arrangement of the Unicode Han characters is based on the positions of characters as they are listed in four major dictionaries. The Kangxi Zidian was chosen as primary because it contains most of the source characters and because the dictionary itself and the principles of character ordering it employs are commonly used throughout East Asia.
The Han ideograph arrangement follows the index (page and position) of the dictionaries listed in Table 18-7 with their priorities.
Priority | Dictionary | City | Publisher | Version |
---|---|---|---|---|
1 | Kangxi Zidian | Beijing | Zhonghua Bookstore, 1989 | Seventh edition |
2 | Dai Kan-Wa Jiten | Tokyo | Taishuukan Shoten, 1986 | Revised edition |
3 | Hanyu Da Zidian | Chengdu | Sichuan Cishu Publishing, 1986 | First edition |
4 | Dae Jaweon | Seoul | Samseong Publishing Co. Ltd, 1988 | First edition |
When a character is found in the Kangxi Zidian, it follows the Kangxi Zidian order. When it is not found in the Kangxi Zidian and it is found in Dai Kan-Wa Jiten, it is given a position extrapolated from the Kangxi position of the preceding character in Dai Kan-Wa Jiten. When it is not found in either Kangxi or Dai Kan-Wa, then the Hanyu Da Zidian and Dae Jaweon dictionaries are consulted in a similar manner.
Ideographs with simplified Kangxi radicals are placed in a group following the traditional Kangxi radical from which the simplified radical is derived. For example, characters with the simplified radical ⻈ corresponding to Kangxi radical ⾔ follow the last nonsimplified character having ⾔ as a radical. The arrangement for these simplified characters is that of the Hanyu Da Zidian.
The few characters that are not found in any of the four dictionaries are placed following characters with the same Kangxi radical and stroke count. The radical-stroke order that results is a culturally neutral order. It does not exactly match the order found in common dictionaries.
Information for sorting all CJK ideographs by the radical-stroke method is found in the Unihan Database (see Unicode Standard Annex #38, “Unicode Han Database (Unihan)”). It should be used if characters from the various blocks containing ideographs (see Table 18-1) are to be properly interleaved. Note, however, that there is no standard way of ordering characters with the same radical-stroke count; for most purposes, Unicode code point order would be as acceptable as any other way.
Details regarding the form of the online charts for the CJK unified ideographs are discussed in Section 24.2, CJK and Other Ideographs.
#18.1.8 Radical-Stroke Indices
Various radical-stroke indices are provided on the Unicode website to ease the search for particular Han ideographs in the Unicode Standard. An interactive radical-stroke index page enables queries by specific Kangxi radical numbers and the number of residual strokes. Three radical-stroke indices are also provided in PDF format. The more extensive of them covers all of the ideographs in the CJK Unified Ideographs and CJK Compatibility Ideographs blocks. There are also more compact radical-stroke indices that are limited to the Han ideographs as specified by the IICore and UnihanCore2020 subsets.
The most authoritative source for radical-stroke information is the eighteenth-century Kangxi dictionary, which established the classification system of 214 radicals. The main issue with using Kangxi radicals today is that many simplified ideographs are difficult to classify under the system of 214 Kangxi radicals. As a result, various modern radical classification systems have been established. However, none of them is in general use, and the 214 Kangxi radicals remain the most universally recognized to this day. See “CJK and Kangxi Radicals” later in this section for more details.
According to the traditional radical-stroke classification system, each Han ideograph is considered to be written with a radical plus its residual strokes. For example, the ideograph 說 is assigned to the radical 言 and has seven residual strokes. To find the ideograph 說 in a dictionary, one would first locate the section for its radical, 言, and then find the subsection for ideographs with seven residual strokes. With the exception of ideographs that are classified under a simplified radical, simplified ideographs are generally classified under the same radical as their traditional forms. For example, the simplified ideograph 伣 and its traditional form, 俔, are both classified under the radical ⼈.
This classification system is complicated by the fact that there are occasional ambiguities in the counting of strokes of the radical itself or the ideograph’s residual components. It is further complicated in that two or more ideograph dictionaries may disagree under which particular radical an ideograph is classified. Ideographs classified under more than one radical may thus appear more than once in the radical-stroke indices.
#18.1.9 Mappings for Han Ideographs
The mappings defined by the IRG between the ideographs in the Unicode Standard and the IRG sources are specified in the Unihan Database. These mappings are considered to be normative parts of ISO/IEC 10646 and of the Unicode Standard; that is, the characters are defined to be the targets for conversion of these characters in these character set standards.
These mappings have been derived from editions of the source standards provided directly to the IRG by its member bodies, and they may not match mappings derived from the published editions of these standards. For this reason, developers may choose to use alternative mappings more directly correlated with published editions.
Specialized conversion systems may also choose more sophisticated mapping mechanisms—for example, semantic conversion, variant normalization, or conversion between simplified and traditional Chinese.
The Unicode Consortium also provides mapping information that extends beyond the normative mappings defined by the IRG. These additional mappings include mappings to character set standards included in the U source, including duplicate characters from KS C 5601-1987, mappings to portions of character set standards omitted from IRG sources, references to standard dictionaries, and suggested character/stroke counts.
#18.1.10 CJK Compatibility Ideographs: U+F900–U+FAFF
The Korean national standard KS C 5601-1987 (now known as KS X 1001:1998), which served as one of the primary source sets for the Unified CJK Ideograph Repertoire and Ordering, Version 2.0, contains 268 duplicate encodings of identical ideograph forms to denote alternative pronunciations. That is, in certain cases, the standard encodes a single character multiple times to denote different linguistic uses. This approach is like encoding the letter “a” five times to denote the different pronunciations it has in the words hat, able, art, father, and adrift. Because they are in all ways identical in shape to their nominal counterparts, they were excluded by the IRG from its sources. For round-trip conversion with KS C 5601-1987, they are encoded separately from the primary CJK Unified Ideographs block.
Another 34 ideographs from various regional and industry standards were encoded in this block, primarily to achieve round-trip conversion compatibility. Twelve of these ideographs (U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, and U+FA29) are not encoded in blocks for CJK unified ideographs. These 12 characters are not duplicates and should be treated as a small extension to the set of unified ideographs.
Except for the 12 unified ideographs just enumerated, CJK compatibility ideographs from this block are not used in Ideographic Description Sequences.
An additional 59 compatibility ideographs are found from U+FA30 to U+FA6A. They are included in the Unicode Standard to provide full round-trip compatibility with the ideographic repertoire of JIS X 0213:2000 and should not be used for any other purpose.
An additional three compatibility ideographs are encoded at the range U+FA6B to U+FA6D. They are included in the Unicode Standard to provide full round-trip compatibility with the ideographic repertoire of the Japanese television standard, ARIB STD-B24, and should not be used for any other purpose.
An additional 106 compatibility ideographs are encoded at the range U+FA70 to U+FAD9. They are included in the Unicode Standard to provide full round-trip compatibility with the ideographic repertoire of KPS 10721-2000. They should not be used for any other purpose.
The names for the compatibility ideographs are also algorithmically derived. Thus the name for the compatibility ideograph U+F900 is CJK COMPATIBILITY IDEOGRAPH-F900. See the formal definition of the Name property in Section 4.8, Name.
All of the compatibility ideographs in this block, except for the 12 unified ideographs, have standardized variation sequences defined in StandardizedVariants.txt. See the discussion in Section 23.4, Variation Selectors for more details.
#18.1.11 CJK Compatibility Supplement: U+2F800–U+2FA1D
The CJK Compatibility Ideographs Supplement block consists of additional compatibility ideographs required for round-trip compatibility with CNS 11643-1992, planes 3, 4, 5, 6, 7, and 15. They should not be used for any other purpose and, in particular, may not be used in Ideographic Description Sequences.
All of the additional compatibility ideographs in this block have standardized variation sequences defined in StandardizedVariants.txt. See the discussion in Section 23.4, Variation Selectors for more details.
#18.1.12 Kanbun: U+3190–U+319F
This block contains a set of Kanbun marks that are used in Japanese literary texts to indicate the Japanese reading order of Classical Chinese poetry and prose. These marks, named for the Japanese word for Chinese writing (漢文), occur particularly in Japanese educational and scholastic texts. They are typically written in an annotation style, placed interlinearly at the left side of each line of vertically rendered original Chinese text. Typesetting Kanbun text is inherently complex, requiring some form of markup and special handling to achieve the desired layout results.
Fourteen of the Kanbun marks, in the range U+3192 ㆒ IDEOGRAPHIC ANNOTATION ONE MARK through U+319F ㆟ IDEOGRAPHIC ANNOTATION MAN MARK, have compatibility decompositions to a corresponding CJK unified ideograph. These marks are merely special-purpose variants of those CJK unified ideographs, used with a specialized meaning and layout rules in Kanbun text. The way the glyphs are shown in the code charts at reduced size and raised above the baseline is intended to mimic their appearance as formatted for use in annotations. This appearance is the reason the compatibility mappings have been assigned the tag <super>. The compatibility mappings do not imply that these characters are appropriate for use as superscript forms in ordinary Chinese text; the preferred means for that purpose are text styles or markup in rich text. (See Section 22.4, Superscript and Subscript Symbols for more information.) Common practice for existing Japanese fonts that support these characters is to provide their glyphs at full size, with the expectation that the layout engine will scale and position them accordingly, per the layout specification for Kanbun text in JIS X 4051.
#18.1.13 Symbols Derived from Han Ideographs
A number of symbols derived from Han ideographs can be found in other blocks. See “Enclosed CJK Letters and Months: U+3200–U+32FF,” “CJK Compatibility: U+3300–U+33FF,” and “Enclosed Ideographic Supplement: U+1F200–U+1F2FF” in Section 22.10, Enclosed and Square.
#18.1.14 Kangxi Radicals and CJK Radicals Supplement: U+2F00–U+2FD5, U+2E80–U+2EF3
The Unicode Standard includes two blocks of Han ideographic radicals that are commonly used to index ideograph dictionaries: the Kangxi Radicals block (U+2F00..U+2FD5), which contains the 214 radicals as used in the eighteenth-century Kangxi dictionary, and the CJK Radicals Supplement block (U+2E80..U+2EF3), which contains variant forms of some Kangxi radicals, either when they occur as ideograph components or in simplified form according to conventions in China and Japan.
The term radical comes from the Latin radix, which means “root,” and refers to the part of an ideograph under which it is classified in most ideograph dictionaries. See “Radical-Stroke Indices” earlier in this section for a more detailed discussion of how ideographic radicals are used in radical-stroke indices.
Nearly all of the characters in the Kangxi Radicals and CJK Radicals Supplement blocks are equivalent to ideographs in the CJK Unified Ideographs blocks, but should not be used interchangeably. (See the “Semantics” subsection below.) Radicals that have one form as an independent ideograph and another as part of an ideograph are generally encoded in both forms in the CJK Unified Ideographs blocks, such as U+6C34 水 and U+6C35 氵 for the radical meaning “water.” See the Equivalent_Unified_Ideograph property in the Unicode Character Database for mappings of nearly all characters in these blocks to equivalent ideographs in the CJK Unified Ideographs blocks.
#Standards. CNS 11643-1992 included a block of radicals separate from its ideograph block, which included 213 of the 214 Kangxi radicals. The missing radical is the 34th one, which is encoded as U+2F21 ⼡ KANGXI RADICAL GO in the Unicode Standard. Amendment 1 of the CNS 11643:2007 standard, which was published in 2023, appended the missing radical to this block, which now includes all 214 Kangxi radicals.
#Chinese and Non-Chinese Simplified Radicals. Chinese is not the only language whose writing system uses simplified radicals. Japanese, and to some extent Vietnamese, also make use of simplified radicals. Among the simplified radicals, a small number are shared by Chinese and non-Chinese languages, such as U+2EA6 ⺦ CJK RADICAL SIMPLIFIED HALF TREE TRUNK and U+2EE8 ⻨ CJK RADICAL SIMPLIFIED WHEAT. Others have separate Chinese and Japanese forms, such as U+2EEE ⻮ CJK RADICAL C-SIMPLIFIED TOOTH and U+2EED ⻭ CJK RADICAL J-SIMPLIFIED TOOTH. Some simplified radicals are not included in the CJK Radicals Supplement block, such as U+9F21 鼡, which is the Japanese simplified form of U+2FCF ⿏ KANGXI RADICAL RAT. See Table 18-8 for a complete treatment of Chinese simplified and non-Chinese simplified radicals, together with their equivalent unified ideographs.
Radical | Traditional Form | Chinese Simplified Form | Non-Chinese Simplified Form | |||
---|---|---|---|---|---|---|
182 | U+2FB5 ⾵ | U+98A8 風 | U+2EDB ⻛ | U+98CE 风 | U+322C4 𲋄 | |
208 | U+2FCF ⿏ | U+9F20 鼠 | U+9F21 鼡 | |||
210 | U+2FD1 ⿑ | U+9F4A 齊 | U+2EEC ⻬ | U+9F50 齐 | U+2EEB ⻫ | U+6589 斉 |
211 | U+2FD2 ⿒ | U+9F52 齒 | U+2EEE ⻮ | U+9F7F 齿 | U+2EED ⻭ | U+6B6F 歯 |
212 | U+2FD3 ⿓ | U+9F8D 龍 | U+2EF0 ⻰ | U+9F99 龙 | U+2EEF ⻯ | U+7ADC 竜 |
U+31DE5 𱷥 | ||||||
213 | U+2FD4 ⿔ | U+9F9C 龜 | U+2EF3 ⻳ | U+9F9F 龟 | U+2EF2 ⻲ | U+4E80 亀 |
#Semantics. Characters in the CJK Radicals Supplement and Kangxi Radicals blocks should not be used as ideographs, because they have different properties and semantics. For example, U+2F00 ⼀ KANGXI RADICAL ONE should not be used in lieu of U+4E00 一 CJK UNIFIED IDEOGRAPH-4E00. The former is to be treated as a symbol, and the latter is to be treated as a word or a part of a word. Except in circumstances where it is necessary to make a semantic distinction between an ideograph in its role as a radical and the same ideograph in its role as an ideograph, the characters in the CJK Unified Ideographs blocks should be used instead of the characters in these blocks.
#Representative Glyphs. The Kangxi Radicals block uses representative glyphs that closely adhere to the forms as found in the Kangxi dictionary itself, which are independent of any particular regional convention. However, the CJK Radicals Supplement block includes regional variants whose representative glyphs are appropriate for the region. For example, U+2EEB ⻫ CJK RADICAL J-SIMPLIFIED EVEN and U+2EEF ⻯ CJK RADICAL J-SIMPLIFIED DRAGON adhere to conventions as used in Japan.
#18.1.15 CJK Additions from HKSCS and GB 18030
Several characters have been encoded because of developments in HKSCS-2001 (the Hong Kong Supplementary Character Set) and GB 18030-2000 (the PRC National Standard). Both of these encoding standards were published with mappings to Unicode Private Use Area code points. PUA ideographic characters that could not be remapped to non-PUA CJK ideographs were added to the existing block of CJK Unified Ideographs. Fourteen new ideographs (U+9FA6..U+9FB3) were added from HKSCS, and eight multistroke ideographic components (U+9FB4..U+9FBB) were added from GB 18030.
To complete the mapping to these two Chinese standards, a number of non-ideographic characters were encoded elsewhere in the standard. In particular, two symbol characters from HKSCS were added to the existing Miscellaneous Technical block: U+23DA EARTH GROUND and U+23DB FUSE. A new block, CJK Strokes (U+31C0..U+31EF), was created and populated with a number of stroke symbols from HKSCS. Another block, Vertical Forms (U+FE10..U+FE1F), was created for vertical punctuation compatibility characters from GB 18030.
#18.1.16 CJK Strokes: U+31C0–U+31EF
Characters in the CJK Strokes block are single-stroke components of CJK ideographs. The first characters assigned to this block were 16 HKSCS–2001 PUA characters that had been excluded from CJK Unified Ideograph Extension B on the grounds that they were not true ideographs. Further additions consist of traditionally defined stroke types attested in the representative forms appearing in the Unicode CJK ideograph code charts or occurring in pre-unification source glyphs. See the Equivalent_Unified_Ideograph property in the Unicode Character Database for mappings of most CJK strokes to equivalent CJK unified ideographs.
CJK strokes are used with highly specific semantics (primarily to index ideographs), but they may lack the monosyllabic pronunciations and logographic functions typically associated with independent ideographs. The strokes in this block are single strokes of well-defined types. For more information about these strokes, see Appendix F, Documentation of CJK Strokes.
#18.1.17 Ideographic Symbols and Punctuation: U+16FE0–U+16FFF
The Ideographic Symbols and Punctuation block covers historic and less common symbols and punctuation associated with various ideographic scripts. Included, for example, are iteration marks for Tangut, Nüshu, and old Chinese, as well as reading marks associated with Vietnamese use of Han characters.
#18.2 Ideographic Description Characters
#18.2.1 Ideographic Description Characters: U+2FF0–U+2FFF
Although the Unicode Standard includes nearly 100,000 CJK unified ideographs, thousands of extremely rare CJK ideographs have nevertheless been left unencoded. Research into cataloging additional ideographs for encoding continues, but it is anticipated that at no point will the entire set of potential, encodable ideographs be completely exhausted. In particular, ideographs continue to be coined and such new coinages will invariably be unencoded.
The 16 characters in the Ideographic Description Characters block plus the additional Ideographic Description character encoded at U+31EF provide a mechanism for the standard interchange of text that must reference unencoded ideographs. Unencoded ideographs can be described using these characters and encoded ideographs; the reader can then create a mental picture of the ideographs from the description.
This process is different from a formal encoding of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideographs; there is no equivalence defined for described ideographs. Conceptually, ideographic descriptions are more akin to the English phrase “an ‘e’ with an acute accent on it” than to the character sequence <U+0065, U+0301>.
In particular, support for the characters in the Ideographic Description Characters block does not require the rendering engine to recreate the graphic appearance of the described character.
Note also that many of the ideographs that users might represent using the Ideographic Description characters will be formally encoded in future versions of the Unicode Standard.
The Ideographic Description Algorithm depends on the fact that virtually all CJK ideographs can be broken down into smaller pieces that are themselves ideographs. The broad coverage of the ideographs already encoded in the Unicode Standard implies that the vast majority of unencoded ideographs can be represented using the Ideographic Description characters.
Although Ideographic Description Sequences are intended primarily to represent unencoded ideographs and should not be used in data interchange to represent encoded ideographs, they also have pedagogical and analytic uses. A researcher, for example, may choose to represent the character U+86D9 蛙 as “⿰虫圭” in a database to provide a link between it and other characters sharing its phonetic, such as U+5A03 娃. The IRG is using Ideographic Description Sequences in this fashion to help provide a first-approximation, machine-generated set of unifications for its current work.
#Applicability to Other Scripts. The characters in the Ideographic Description Characters block were originally derived from a Chinese standard and were encoded for use specifically in describing CJK ideographs. As a result, the following detailed description of Ideographic Description Sequences is specified entirely in terms of CJK unified ideographs and CJK radicals. However, there are several large, historic East Asian scripts whose writing systems were heavily influenced by the Han script. Like the Han script, those siniform historic scripts, which include Tangut, Jurchen, and Khitan, are logographic in nature. Furthermore, they built up characters using radicals and components, and with side-by-side and top-to-bottom stacking very similar in structure to the way CJK ideographs are composed.
The general usefulness of Ideographic Description Sequences for describing unencoded characters and the applicability of the characters in the Ideographic Description Characters block to description of siniform logographs mean that the syntax for Ideographic Description Sequences can be generalized to extend to additional East Asian logographic scripts.
#Ideographic Description Sequences. Ideographic Description Sequences are defined by the following grammar. The list of characters associated with the Ideographic and Radical properties can be found in the Unicode Character Database. In particular, the Ideographic property is intended to apply to other siniform ideographic systems, in addition to CJK ideographs. Nüshu ideographs, Tangut ideographs, and Tangut components can also be used as elements of an Ideographic Description Sequence.
IDS := Ideographic | Radical | CJK_Stroke | Private Use | U+FF1F | IDS_UnaryOperator IDS | IDS_BinaryOperator IDS IDS | IDS_TrinaryOperator IDS IDS IDS CJK_Stroke := U+31C0 | ... | U+31E5 IDS_UnaryOperator := U+2FFE | U+2FFF IDS_BinaryOperator := U+2FF0 | U+2FF1 | U+2FF4 | ... | U+2FFD | U+31EF IDS_TrinaryOperator := U+2FF2 | U+2FF3
Previous versions of the Unicode Standard imposed various limits on the length of a sequence or parts of it, and restricted the use of IDSes to CJK Unified Ideographs. Those limits and restrictions are no longer imposed by the standard. Although not formally proscribed by the syntax, it is not a good idea to mix scripts in any given Ideographic Description Sequence. For example, it is not meaningful to mix CJK ideographs or CJK radicals with Tangut ideographs or components in a single description.
The operators indicate the relative graphic positions of the operands running from left to right, from top to bottom, or from enclosure to enclosed. A user wishing to represent an unencoded ideograph will need to analyze its structure to determine how to describe it using an Ideographic Description Sequence. As a rule, it is best to use the natural radical-phonetic division for an ideograph if it has one and to use as short a description sequence as possible; however, there is no requirement that these rules be followed. Beyond that, the shortest possible Ideographic Description Sequence is preferred.
Figure 18-8 provides an example IDS for each of the IDCs, along with annotated versions of the IDCs that indicate the order of their operands.
U+2FF0 | ⿰ | U+4EC1 | 仁 | → | ⿰ 亻二 | |
U+2FF1 | ⿱ | U+5409 | 吉 | → | ⿱ 士口 | |
U+2FF2 | ⿲ | U+8857 | 街 | → | ⿲ 彳圭亍 | |
U+2FF3 | ⿳ | U+58F9 | 壹 | → | ⿳ 士冖豆 | |
U+2FF4 | ⿴ | U+56DE | 回 | → | ⿴ 囗口 | |
U+2FF5 | ⿵ | U+51F0 | 凰 | → | ⿵ 几皇 | |
U+2FF6 | ⿶ | U+51F6 | 凶 | → | ⿶ 凵㐅 | |
U+2FF7 | ⿷ | U+5321 | 匡 | → | ⿷ 匚王 | |
U+2FF8 | ⿸ | U+4EC4 | 仄 | → | ⿸ 厂人 | |
U+2FF9 | ⿹ | U+5F0F | 式 | → | ⿹ 弋工 | |
U+2FFA | ⿺ | U+8D85 | 超 | → | ⿺ 走召 | |
U+2FFB | ⿻ | U+5DEB | 巫 | → | ⿻ 工从 | |
U+2FFC | | U+355A | 㕚 | → | 叉丶 | |
U+2FFD | | U+6C37 | 氷 | → | 水丶 | |
U+2FFE | | U+23944 | 𣥄 | → | 正 | |
U+2FFF | | U+20114 | 𠄔 | → | 予 | |
U+31EF | | U+2002A | 𠀪 | → | 其㇒ | |
U+5187 | 冇 | 有𠄠 |
In contrast to the other IDCs, most of which are used to combine components, U+31EF IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION is used to describe the removal (or “subtraction”) of a stroke (or more complex component) from a target character. Its first argument is the ideograph (or component) from which a piece is to be deleted, and the second argument is the stroke (or component) that is to be removed. If the target character lacks the stroke or component to be removed, the sequence has no meaning. The typical use case for U+31EF would be in describing the many historical instances of Han naming taboo characters that exhibit removal of a stroke in the character to avoid the given name of an emperor or an emperor's ancestor. It might also be used to describe modern neologisms, such as the characters for 乒乓 pīngpāng, derived by removal of one stroke each from 兵.
Figure 18-9 illustrates the use of the IDS grammar to provide descriptions of encoded or unencoded ideographs. Examples 9 through 14 illustrate more complex Ideographic Description Sequences showing the use of some of the less common operators.
#Equivalence. Many unencoded ideographs can be described in more than one way using this algorithm, either because the pieces of a description can themselves be broken down further (examples 1 through 3 in Figure 18-9) or because duplications appear within the Unicode Standard (examples 5 through 8 in Figure 18-9).
The Unicode Standard does not define equivalence for two Ideographic Description Sequences that are not identical. Figure 18-9 contains numerous examples illustrating how different Ideographic Description Sequences might be used to describe the same ideograph.
In particular, Ideographic Description Sequences should not be used to provide alternative graphic representations of encoded ideographs in data interchange. Searching, collation, and other content-based text operations would then fail.
#Interaction with the Ideographic Variation Mark. U+303E IDEOGRAPHIC VARIATION INDICATOR (IVI) normally occurs before a CJK unified ideograph, but it may also be placed before an Ideographic Description Sequence to indicate that the description is merely an approximation of the ideograph desired. The IVI is not considered a part of the Ideographic Description Sequence and does not invalidate the sequence.
#Rendering. Ideographic Description characters are visible characters and are not to be treated as control characters. Thus the sequence U+2FF1 U+4E95 U+86D9 must have a distinct appearance from U+4E95 U+86D9.
An implementation may render a valid Ideographic Description Sequence either by rendering the individual characters separately or by parsing the Ideographic Description Sequence and drawing the ideograph so described. In the latter case, the Ideographic Description Sequence should be treated as a ligature of the individual characters for purposes of hit testing, cursor movement, and other user interface operations. (See Section 5.11, Editing and Selection.)
#Character Boundaries. Ideographic Description characters are not combining characters, and there is no requirement that they affect character or word boundaries. Thus U+2FF1 U+4E95 U+86D9 may be treated as a sequence of three characters or even three words.
Implementations of the Unicode Standard may choose to parse Ideographic Description Sequences when calculating word and character boundaries. Note that such a decision will make the algorithms involved significantly more complicated and slower.
#Standards. Most of the Ideographic Description characters are found in GBK—an extension to GB 2312-80 that added all 20,902 Unicode Version 1.1 ideographs not already in GB 2312-80. GBK is defined as a normative annex of GB 13000.1-93.
#18.3 Bopomofo
#18.3.1 Bopomofo: U+3100–U+312F, U+31A0–U+31BF
Bopomofo constitute a set of characters used to annotate or teach the phonetics of Chinese, primarily the standard Mandarin language. These characters are used in dictionaries and teaching materials, but not in the actual writing of Chinese text. The formal Chinese names for this alphabet are Zhuyin-Zimu (“phonetic alphabet”) and Zhuyin-Fuhao (“phonetic symbols”), but the informal term “Bopomofo” (analogous to “ABCs”) provides a more serviceable English name and is also used in China. The Bopomofo were developed as part of a populist literacy campaign following the 1911 revolution; thus they are acceptable to all branches of modern Chinese culture, although in the People’s Republic of China their function has been largely taken over by the Pinyin romanization system.
Bopomofo is a hybrid writing system—part alphabet and part syllabary. The letters of Bopomofo are used to represent either the initial parts or the final parts of a Chinese syllable. The initials are just consonants, as for an alphabet. The finals constitute either simple vowels, vocalic diphthongs, or vowels plus nasal consonant combinations. Because a number of Chinese syllables have no initial consonant, the Bopomofo letters for finals may constitute an entire syllable by themselves. More typically, a Chinese syllable is represented by one initial consonant letter, followed by one final letter. In some instances, a third letter is used to indicate a complex vowel nucleus for the syllable. For example, the syllable that would be written luan in Pinyin is segmented l-u-an in Bopomofo—that is, <U+310C, U+3128, U+3122>.
#Standards. The standard Mandarin set of Bopomofo is included in the People’s Republic of China standards GB 2312 and GB 18030, and in the Republic of China (Taiwan) standard CNS 11643.
#Mandarin Tone Marks. Small modifier letters used to indicate the five Mandarin tones are part of the Bopomofo system. In the Unicode Standard they have been unified into the Modifier Letter range, as shown in Table 18-9.
first tone | U+02C9 MODIFIER LETTER MACRON |
second tone | U+02CA MODIFIER LETTER ACUTE ACCENT |
third tone | U+02C7 CARON |
fourth tone | U+02CB MODIFIER LETTER GRAVE ACCENT |
light tone | U+02D9 DOT ABOVE |
#Standard Mandarin Bopomofo. The order of the Mandarin Bopomofo letters U+3105.. U+3129 is standard worldwide. The code offset of the first letter U+3105 BOPOMOFO LETTER B from a multiple of 16 is included to match the offset in the ISO-registered standard GB 2312.
#Extended Bopomofo. To represent the sounds of Chinese dialects other than Mandarin, the basic Bopomofo set U+3105..U+3129 has been augmented by additional phonetic characters. These extensions are much less broadly recognized than the basic Mandarin set. The three extended Bopomofo characters U+312A..U+312C are cited in some standard reference works, such as the encyclopedia Xin Ci Hai. Another set of 24 extended Bopomofo, encoded at U+31A0..U+31B7, was designed in 1948 to cover additional sounds of the Minnan and Hakka dialects. The extensions are used together with the main set of Bopomofo characters to provide a complete phonetic orthography for those dialects. The four characters encoded at U+31BC..U+31BF were designed to represent additional sounds found in Cantonese.
The small characters encoded at U+31B4..U+31B7 and U+31BB represent syllable-final consonants not present in standard Mandarin or in Mandarin dialects. They have the same shapes as Bopomofo “b”, “d”, “k”, “h”, and “g,” respectively, but are rendered in a smaller form than the initial consonants; they are also generally shown close to the syllable medial vowel character. These final letters are encoded separately so that the Minnan and Hakka dialects can be represented unambiguously in plain text without having to resort to subscripting or other fancy text mechanisms to represent the final consonants. In Cantonese, final consonants not covered by the set of standard Bopomofo rhymes ending in -n or -ng are instead represented by full-sized letters for “p”, “t”, “k”, “m”, “n”, “ng”.
Three Bopomofo letters for sounds found in non-Chinese languages are encoded in the range U+31B8..U+31BA. These characters are used in the Hmu and Ge languages, members of the Hmong-Mien (or Miao-Yao) language family, spoken primarily in southeastern Guizhou. The characters are part of an obsolete orthography for Hmu and Ge devised by the missionary Maurice Hutton in the 1920s and 1930s. A small group of Hmu Christians are still using a hymnal text written by Hutton that contains these characters.
U+312E ㄮ BOPOMOFO LETTER O WITH DOT ABOVE, which was initially thought to be a CJK Unified Ideograph because it appears in Japan’s Dai Kan-Wa Jiten as a kanji, is the original form of U+311C ㄜ BOPOMOFO LETTER E. The Mandarin sound “e” was originally written as U+311B ㄛ BOPOMOFO LETTER O with a dot above. This dotted form was later replaced by a new character that uses a vertical stroke instead of a dot, which is U+311C ㄜ BOPOMOFO LETTER E.
#Extended Bopomofo Tone Marks. In addition to the Mandarin tone marks enumerated in Table 18-9, other tone marks appropriate for use with the extended Bopomofo transcriptions of Minnan and Hakka can be found in the Modifier Letter range, as shown in Table 18-10. The “departing tone” refers to the qusheng in traditional Chinese tonal analysis, with the yin variant historically derived from voiceless initials and the yang variant from voiced initials. Southern Chinese dialects in general maintain more tonal distinctions than Mandarin does.
yin departing tone | U+02EA MODIFIER LETTER YIN DEPARTING TONE MARK |
yang departing tone | U+02EB MODIFIER LETTER YANG DEPARTING TONE MARK |
#Rendering of Bopomofo. Bopomofo is rendered from left to right in horizontal text, but also commonly appears in vertical text. It may be used by itself in either orientation, but typically appears in interlinear annotation of Chinese (Han character) text. Children’s books are often completely annotated with Bopomofo pronunciations for every character. This interlinear annotation is structurally quite similar to the system of Japanese ruby annotation, but it has additional complications that result from the explicit usage of tone marks with the Bopomofo letters.
U+3127 BOPOMOFO LETTER I has notable variation in rendering in horizontal and vertical layout contexts. In traditional typesetting, the stroke of the glyph was chosen to stand perpendicular to the writing direction. In that practice, the glyph is shown as a horizontal stroke in vertically set text, and as a vertical stroke in horizontally set text. However, modern digital typography has changed this practice. All modern fonts use a horizontal stroke glyph for U+3127, and that form is generally used in both horizontal and vertical layout contexts. In the Unicode Standard, the form in the charts follows the modern practice, showing a horizontal stroke for the glyph; the vertical stroke form is considered to be an occasionally occurring variant. Earlier versions of the standard followed traditional typographic practice, and showed a vertical stroke glyph in the charts.
In horizontal interlineation, the Bopomofo is generally placed above the corresponding Han character(s); tone marks, if present, appear at the end of each syllabic group of Bopomofo letters. In vertical interlineation, the Bopomofo is generally placed on the right side of the corresponding Han character(s); tone marks, if present, appear in a separate interlinear row to the right side of the vowel letter. When using extended Bopomofo for Minnan and Hakka, the tone marks may also be mixed with European digits 0–9 to express changes in actual tonetic values resulting from juxtaposition of basic tones.
#18.4 Hiragana and Katakana
#18.4.1 Hiragana: U+3040–U+309F
Hiragana is the cursive syllabary used to write Japanese words phonetically and to write sentence particles and inflectional endings. It is also commonly used to indicate the pronunciation of Japanese words. Hiragana syllables are phonetically equivalent to the corresponding Katakana syllables.
#Standards. The Hiragana block is based on the JIS X 0208-1990 standard, extended by the nonstandard syllable U+3094 HIRAGANA LETTER VU, which is included in some Japanese corporate standards. Some additions are based on the JIS X 0213:2000 standard.
#Combining Marks. Hiragana and the related script Katakana use U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK and U+309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK to generate voiced and semivoiced syllables from the base syllables, respectively. All common precomposed combinations of base syllable forms using these marks are already encoded as characters, and use of these precomposed forms is the predominant JIS usage. These combining marks must follow the base character to which they apply. Because most implementations and JIS standards treat these marks as spacing characters, the Unicode Standard contains two corresponding noncombining (spacing) marks at U+309B and U+309C.
#Iteration Marks. The two characters U+309D HIRAGANA ITERATION MARK and U+309E HIRAGANA VOICED ITERATION MARK are punctuation-like characters that denote the iteration (repetition) of a previous syllable according to whether the repeated syllable has an unvoiced or voiced consonant, respectively.
#Vertical Text Digraph. U+309F HIRAGANA DIGRAPH YORI is a digraph form which was historically used in vertical display contexts, but which is now also found in horizontal layout.
#18.4.2 Katakana: U+30A0–U+30FF
Katakana is the noncursive syllabary used to write non-Japanese (usually Western) words phonetically in Japanese. It is also used to write Japanese words with visual emphasis. Katakana syllables are phonetically equivalent to corresponding Hiragana syllables. Katakana contains two characters, U+30F5 KATAKANA LETTER SMALL KA and U+30F6 KATAKANA LETTER SMALL KE, that are used in special Japanese spelling conventions (for example, the spelling of place names that include archaic Japanese connective particles).
#Standards. The Katakana block is based on the JIS X 0208-1990 standard. Some additions are based on the JIS X 0213:2000 standard.
#Punctuation-like Characters. U+30FB KATAKANA MIDDLE DOT is used to separate words when writing non-Japanese phrases. U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN is a delimiter occasionally used in analyzed Katakana or Hiragana textual material.
U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK is used predominantly with Katakana and occasionally with Hiragana to denote a lengthened vowel of the previously written syllable. The two iteration marks, U+30FD KATAKANA ITERATION MARK and U+30FE KATAKANA VOICED ITERATION MARK, serve the same function in Katakana writing that the two Hiragana iteration marks serve in Hiragana writing.
#Vertical Text Digraph. U+30FF KATAKANA DIGRAPH KOTO is a digraph form which was historically used in vertical display contexts, but which is now also found in horizontal layout.
#18.4.3 Katakana Phonetic Extensions: U+31F0–U+31FF
These extensions to the Katakana syllabary are all “small” variants. They are used in Japan for phonetic transcription of Ainu and other languages. They may be used in combination with U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK and U+309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK to indicate modification of the sounds represented.
#Standards. The Katakana Phonetic Extensions block is based on the JIS X 0213:2000 standard.
#18.4.4 Small Kana Extension: U+1B130-U+1B16F
The Small Kana Extension block contains additional small variants for the Hiragana syllabary and the Katakana syllabary. A significant number of these small variant kana are attested from sources, which include phonetic transcription of non-Japanese terms in musical scores, maps showing place names, and other documents. The small kana variants currently included in this block cover the best attested subset, including forms used in Old Japanese. They are ordered so that gaps in the code chart may be filled in with further small variant kana, when their attestations are better documented.
#18.4.5 Kana Supplement: U+1B000–U+1B0FF
#Kana Extended-A: U+1B100–U+1B12F
The Kana Supplement and Kana Extended-A blocks are intended for the encoding of historic and variant forms of Japanese kana characters, including those variants collectively known as hentaigana (variant shaped kana) in Japanese.
The character U+1B000 KATAKANA LETTER ARCHAIC E is an obsolete form of U+30A8 KATAKANA LETTER E, which has not been used in Japanese orthography for about one thousand years. In its pre-10th century use, this character represented the syllable “e”, and U+30A8 KATAKANA LETTER E represented the syllable “ye”. The character U+1B001 HIRAGANA LETTER ARCHAIC YE was originally encoded to represent a long-obsolete syllable that would have come between U+3086 HIRAGANA LETTER YU and U+3088 HIRAGANA LETTER YO. This syllable merged with “e”, which is now represented by U+3048 HIRAGANA LETTER E. These relationships are illustrated in Figure 18-10.
The hentaigana 𛀁, which would have been named HENTAIGANA LETTER E-1, has been unified with the existing U+1B001 HIRAGANA LETTER ARCHAIC YE and is aliased accordingly. When sorting, U+1B001 HIRAGANA LETTER ARCHAIC YE should appear between U+1B00E HENTAIGANA LETTER U-5 and U+1B00F HENTAIGANA LETTER E-2.
The 285 remaining characters in these blocks are additional hentaigana that represent obsolete or nonstandard hiragana that were in use in Japan up until the script reform of 1900 that standardized the use of a single character for each syllable. Hentaigana are still in use today in Japan, but are limited to Japan’s family registry (koseki in Japanese) and specialized uses, such as business signage and other decor that are specifically designed to convey a feeling of nostalgia or traditional charm.
Each hentaigana is associated with a single parent unified ideograph, a cursive form of which served as the basis for its shape, and generally correspond to a single syllable. Hentaigana that correspond to the same syllable, but that do not share the same parent unified ideograph have different shapes and are therefore encoded separately. For example, U+1B006 HENTAIGANA LETTER I-1 through U+1B009 HENTAIGANA LETTER I-4 all correspond to the same syllable i (U+3044 HIRAGANA LETTER I), but have parent unified ideographs U+4EE5 以, U+4F0A 伊, U+610F 意, and U+79FB 移, respectively, as shown in Figure 18-11.
Some hentaigana that correspond to the same syllable and share the same parent unified ideograph are also encoded separately because they have different shapes. For example, U+1B080 HENTAIGANA LETTER NA-3 through U+1B082 HENTAIGANA LETTER NA-5 correspond to the same syllable na (U+306A HIRAGANA LETTER NA) and share the same parent unified ideograph U+5948 奈, as shown in Figure 18-12.
A small number of hentaigana that share the same parent unified ideograph are associated with two or three different syllables reflected in their names, such as U+1B07D HENTAIGANA LETTER TO-RA that is associated with the syllables to (U+3068 HIRAGANA LETTER TO) and ra (U+3089 HIRAGANA LETTER RA), and U+1B11D HENTAIGANA LETTER N-MU-MO-1 that is associated with the syllables n (U+3093 HIRAGANA LETTER N), mu (U+3080 HIRAGANA LETTER MU), and mo (U+3082 HIRAGANA LETTER MO). Their parent unified ideographs are U+7B49 等 and U+65E0 无, respectively. These associations are also illustrated in Figure 18-12.
#18.4.6 Kana Extended-B: U+1AFF0-U+1AFFF
The Kana Extended-B block encodes tone marks used alongside furigana to annotate Minnan languages in an orthography known in Japanese as Taiwanese kana (台湾語仮名, taiwango kana). These character forms date back to the work of the Japanese linguist Naoyoshi Ogawa (小川尚義) in the early 20th century.
These characters are not, however, a mere historical curiosity. The linguist Âng Ûi-jîn (洪惟仁) produced a dictionary using them as recently as 1993, the Tâi-ji̍t Tōa Sû-tián (臺日大辭典).
The kana with their tone marks appear historically as interlinear annotations to the right of each ideographic character in vertical text. They are not historically attested in horizontally typeset documents. The tone marks in this block appear to the right of the kana characters, in some ways similar to the rendering of tone marks with Bopomofo characters, described in Section 18.3, Bopomofo. At most one tone mark from this block appears to the right of each syllabic group. Marks from other blocks may also appear above or below the kana.
The orthography contains two diacritics, which represent various sound changes depending on the Minnan language being annotated. U+0323 COMBINING DOT BELOW is used to represent the aspiration mark (送氣符, sàng-khì hû). U+0305 COMBINING OVERLINE is used to represent the line above which, depending on the dialect of Minnan being annotated, results in various sound changes (発音符, huat-im hû). COMBINING OVERLINE may occur over both the small and large versions of the vowels ウ and オ when the Quanzhou dialect (泉州話, Choân-chiu-oē) is being annotated.
Figure 18-13 shows an example with the annotated Minnan phrase 恬恬聽, which means “quietly listening,” typeset in vertical interlineation. Such interlinear text cannot be represented directly in plain text; higher level protocols must render the ideographic block characters and the furigana in separate runs. For this and subsequent examples, the CJK ideograph sequence is <606C 606C 807D>. The furigana annotation sequence in each case is <30C1 0305 30A1 30E0 1AFF5 30C1 0305 30A1 30E0 1AFF5 30C1 0305 0323 30A2 1AFF7>. The dialect of Minnan affects the annotation, so this is but one possible annotation of 恬恬聽, from a Taiwanese textbook for teaching Japanese published in 1902.
In non-interlinear vertical katakana text, the tone marks once again appear to the right side of the katakana, as shown in Figure 18-14. Historically, they were most often used for this purpose in pedagogical materials.
As most modern CJK documents are horizontally typeset, it may be convenient to include these furigana in horizontal interlineation. However, as there are neither historic nor widely accepted forms of the tone mark characters when displayed above ideographic characters, rather than to their right, the furigana may be rendered as if the text were vertical, but with the ideographic characters being written in horizontal order, as in Figure 18-15.
In non-interlinear horizontal text the recommended presentation is to display the tone marks after the katakana syllables, as shown in Figure 18-16. Horizontal text which uses Kana Extended-B characters is ahistorical, but still extant, as modern CJK languages are often written horizontally.
The characters of the Kana Extended-B block only annotate regular, fullwidth katakana characters. There are no historical examples of the annotation of halfwidth forms of katakana found in the block Halfwidth and Fullwidth Forms.
#18.5 Halfwidth and Fullwidth Forms
#18.5.1 Halfwidth and Fullwidth Forms: U+FF00–U+FFEF
In the context of East Asian coding systems, a double-byte character set (DBCS), such as JIS X 0208-1990 or KS X 1001:1998, is generally used together with a single-byte character set (SBCS), such as ASCII or a variant of ASCII. Text that is encoded with both a DBCS and SBCS is typically displayed such that the glyphs representing DBCS characters occupy two display cells—where a display cell is defined in terms of the glyphs used to display the SBCS (ASCII) characters. In these systems, the two-display-cell width is known as the fullwidth or zenkaku form, and the one-display-cell width is known as the halfwidth or hankaku form. While zenkaku and hankaku are Japanese terms, the display-width concepts apply equally to Korean and Chinese implementations.
Because of this mixture of display widths, certain characters often appear twice—once in fullwidth form in the DBCS repertoire and once in halfwidth form in the SBCS repertoire. To achieve round-trip conversion compatibility with such mixed-width encoding systems, it is necessary to encode both fullwidth and halfwidth forms of certain characters. This block consists of the additional forms needed to support conversion for existing texts that employ both forms.
In the context of conversion to and from such mixed-width encodings, all characters in the General Scripts Area should be construed as halfwidth (hankaku) characters if they have a fullwidth equivalent elsewhere in the standard or if they do not occur in the mixed-width encoding; otherwise, they should be construed as fullwidth (zenkaku). Specifically, most characters in the CJK Miscellaneous Area and the CJKV Ideograph Area, along with the characters in the CJK Compatibility Ideographs, CJK Compatibility Forms, and Small Form Variants blocks, should be construed as fullwidth (zenkaku) characters. For a complete description of the East Asian Width property, see Unicode Standard Annex #11, “East Asian Width.”
The characters in this block consist of fullwidth forms of the ASCII block (except SPACE), certain characters of the Latin-1 Supplement, and some currency symbols. In addition, this block contains halfwidth forms of the Katakana and Hangul Compatibility Jamo characters. Finally, a number of symbol characters are replicated here (U+FFE8..U+FFEE) with explicit halfwidth semantics.
#Unifications. The fullwidth form of U+0020 SPACE is unified with U+3000 IDEOGRAPHIC SPACE.
#18.6 Hangul
Korean Hangul may be considered a featural syllabic script. As opposed to many other syllabic scripts, the syllables are formed from a set of alphabetic components in a regular fashion. These alphabetic components are called jamo.
The name Hangul itself is just one of several terms that may be used to refer to the script. In some contexts, the preferred term is simply the generic Korean characters. Hangul is used more frequently in South Korea, whereas a basically synonymous term Choseongul is preferred in North Korea. A politically neutral term, Jeongum, may also be used.
The Unicode Standard contains both the complete set of precomposed modern Hangul syllable blocks and a set of conjoining Hangul jamo. The conjoining Hangul jamo can be used to represent all of the modern Hangul syllable blocks, as well as the obsolete syllable blocks composed of at least one Hangul jamo that the Korean orthographic standard in 1933 excluded from modern use. For a description of conjoining jamo behavior and precomposed Hangul syllables, see Section 3.12, Conjoining Jamo Behavior. For a discussion of the interaction of combining marks with jamo and Hangul syllables, see “Combining Marks and Korean Syllables” in Section 3.6, Combination. Note that the representation of Old Korean requires two combining tone marks for Hangul, U+302E and U+302F.
For other blocks containing characters related to Hangul, see “Enclosed CJK Letters and Months: U+3200–U+32FF” and “CJK Compatibility: U+3300–U+33FF” in Section 22.10, Enclosed and Square, as well as Section 18.5, Halfwidth and Fullwidth Forms.
#18.6.1 Hangul Jamo: U+1100–U+11FF
The Hangul Jamo block contains the most frequently used conjoining jamo. These include all of the jamo used in modern Hangul syllable blocks, as well as many of the jamo for Old Korean.
The Hangul jamo are divided into three classes: choseong (leading consonants, or syllable-initial characters), jungseong (vowels, or syllable-peak characters), and jongseong (trailing consonants, or syllable-final characters). Each class may, in turn, consist of one to three subunits. For example, a choseong syllable-initial character may either represent a single consonant sound, or a consonant cluster consisting of two or three consonant sounds. Likewise, a jungseong syllable-peak character may represent a simple vowel sound, or a complex diphthong or triphthong with onglide or offglide sounds. Each of these complex sequences of two or three sounds is encoded as a single conjoining jamo character. Therefore, a complete Hangul syllable can always be conceived of as a single choseong followed by a single jungseong and (optionally) a single jongseong.
This block also contains two invisible filler characters which act as placeholders for a missing choseong or jungseong in an incomplete syllable. These filler characters are U+115F HANGUL CHOSEONG FILLER and U+1160 HANGUL JUNGSEONG FILLER.
#18.6.2 Hangul Jamo Extended-A: U+A960–U+A97F
This block is an extension of the conjoining jamo. It contains additional complex leading consonants (choseong) needed to complete the set of conjoining jamo for the representation of Old Korean.
#18.6.3 Hangul Jamo Extended-B: U+D7B0–U+D7FF
This block is an extension of the conjoining jamo. It contains additional complex vowels (jungseong) and trailing consonants (jongseong) needed to complete the set of conjoining jamo for the representation of Old Korean.
#18.6.4 Hangul Compatibility Jamo: U+3130–U+318F
This block consists of spacing, nonconjoining Hangul consonant and vowel (jamo) elements. These characters are provided solely for compatibility with the KS X 1001:1998 standard. Unlike the characters found in the Hangul Jamo block (U+1100..U+11FF), the jamo characters in this block have no conjoining semantics.
The characters of this block are considered to be fullwidth forms in contrast with the halfwidth Hangul compatibility jamo found at U+FFA0..U+FFDF.
#Standards. The Unicode Standard follows KS X 1001:1998 for Hangul Jamo elements.
#Normalization. When Hangul compatibility jamo are transformed with a compatibility normalization form, NFKD or NFKC, the characters are converted to the corresponding conjoining jamo characters. Where the characters are intended to remain in separate syllables after such transformation, they may require separation from adjacent characters. This separation can be achieved by inserting any non-Korean character.
- U+200B ZERO WIDTH SPACE is recommended where the characters are to allow a line break.
- U+2060 WORD JOINER can be used where the characters are not to break across lines.
Table 18-11 illustrates how two Hangul compatibility jamo can be separated in display, even after transforming them with NFKD or NFKC.
#18.6.5 Hangul Syllables: U+AC00–U+D7AF
The Hangul script used in the Korean writing system consists of individual consonant and vowel letters (jamo) that are visually combined into square display cells to form entire syllable blocks. Hangul syllables may be encoded directly as precomposed combinations of individual jamo or as decomposed sequences of conjoining jamo.
Modern Hangul syllable blocks can be expressed with either two or three jamo, either in the form consonant + vowel or in the form consonant + vowel + consonant. There are 19 possible leading (initial) consonants (choseong), 21 vowels (jungseong), and 27 trailing (final) consonants (jongseong). Thus there are 399 possible two-jamo syllable blocks and 10,773 possible three-jamo syllable blocks, giving a total of 11,172 modern Hangul syllable blocks. This collection of 11,172 modern Hangul syllables encoded in this block is known as the Johab set.
#Standards. The Hangul syllables are taken from KS C 5601-1992, representing the full Johab set. This group represents a superset of the Hangul syllables encoded in earlier versions of Korean standards (KS C 5601-1987 and KS C 5657-1991).
#Equivalence. Each of the Hangul syllables encoded in this block may be represented by an equivalent sequence of conjoining jamo. The converse is not true because thousands of archaic Hangul syllables may be represented only as a sequence of conjoining jamo.
#Hangul Syllable Composition. The Hangul syllables can be derived from conjoining jamo by a regular process of composition. The algorithm that maps a sequence of conjoining jamo to the encoding point for a Hangul syllable in the Johab set is detailed in Section 3.12, Conjoining Jamo Behavior.
#Hangul Syllable Decomposition. Any Hangul syllable from the Johab set can be decomposed into a sequence of conjoining jamo characters. The algorithm that details the formula for decomposition is also provided in Section 3.12, Conjoining Jamo Behavior.
#Hangul Syllable Name. The character names for Hangul syllables are derived algorithmically from the decomposition. (For full details, see Section 3.12, Conjoining Jamo Behavior.)
#Hangul Syllable Representative Glyph. The representative glyph for a Hangul syllable can be formed from its decomposition based on the categorization of vowels shown in Table 18-12.
Vertical | Horizontal | Both | |||
---|---|---|---|---|---|
1161 | A | 1169 | O | 116A | WA |
1162 | AE | 116D | YO | 116B | WAE |
1163 | YA | 116E | U | 116C | OE |
1164 | YAE | 1172 | YU | 116F | WEO |
1165 | EO | 1173 | EU | 1170 | WE |
1166 | E | 1171 | WI | ||
1167 | YEO | 1174 | YI | ||
1168 | YE | ||||
1175 | I |
If the vowel of the syllable is based on a vertical line, place the preceding consonant to its left. If the vowel is based on a horizontal line, place the preceding consonant above it. If the vowel is based on a combination of vertical and horizontal lines, place the preceding consonant above the horizontal line and to the left of the vertical line. In either case, place a following consonant, if any, below the middle of the resulting group.
In any particular font, the exact placement, shape, and size of the components will vary according to the shapes of the other characters and the overall design of the font.
#Collation. The unit of collation in Korean text is normally the Hangul syllable. The order of the syllables in the Hangul Syllables block reflects the preferred collation order used in the Republic of Korea. If sequences of Hangul syllables are collated with a simple binary comparison, the result will reflect that collation order. More sophisticated collation algorithms are required to obtain other collation orders, such as the one preferred in the Democratic People’s Republic of Korea.
When Korean text includes sequences of conjoining jamo, as for Old Korean, or mixtures of precomposed syllable blocks and conjoining jamo, the easiest approach for collation is to decompose the precomposed syllable blocks into conjoining jamo before comparing. Additional steps must be taken to ensure that comparison is then done for sequences of conjoining jamo that comprise complete syllables. See Unicode Technical Report #10, “Unicode Collation Algorithm,” for more discussion about the collation of Korean.
#18.7 Yi
#18.7.1 Yi: U+A000–U+A4CF
The Yi syllabary encoded in Unicode is used to write the Liangshan dialect of the Yi language, a member of the Sino-Tibetan language family.
Yi is the Chinese name for one of the largest ethnic minorities in the People’s Republic of China. The Yi, also known historically and in English as the Lolo, do not have a single ethnonym, but refer to themselves variously as Nuosu, Sani, Axi or Misapo. According to the 1990 census, more than 6.5 million Yi live in southwestern China in the provinces of Sichuan, Guizhou, Yunnan, and Guangxi. Smaller populations of Yi are also to be found in Myanmar, Laos, and Vietnam. Yi is one of the official languages of the PRC, with between 4 and 5 million speakers.
The Yi language is divided into six major dialects. The Northern dialect, which is also known as the Liangshan dialect because it is spoken throughout the region of the Greater and Lesser Liangshan Mountains, is the largest and linguistically most coherent of these dialects. In 1991, there were about 1.6 million speakers of the Liangshan Yi dialect. The ethnonym of speakers of the Liangshan dialect is Nuosu.
#Traditional Yi Script. The traditional Yi script, historically known as Cuan or Wei, is an ideographic script. Unlike in other Chinese-influenced siniform scripts, however, the ideographs of Yi appear not to be derived from Han ideographs. One of the more widespread traditions relates that the script, comprising about 1,840 ideographs, was devised by someone named Aki during the Tang dynasty (618–907 CE). The earliest surviving examples of the Yi script are monumental inscriptions dating from about 500 years ago; the earliest example is an inscription on a bronze bell dated 1485.
There is no single unified Yi script, but rather many local script traditions that vary considerably with regard to the repertoire, shapes, and orientations of individual glyphs and the overall writing direction. The profusion of local script variants occurred largely because until modern times the Yi script was mainly used for writing religious, magical, medical, or genealogical texts that were handed down from generation to generation by the priests of individual villages, and not as a means of communication between different communities or for the general dissemination of knowledge. Although a vast number of manuscripts written in the traditional Yi script have survived to the present day, the Yi script was not widely used in printing before the 20th century.
Because the traditional Yi script is not standardized, a considerable number of glyphs are used in the various script traditions. According to one authority, there are more than 14,200 glyphs used in Yunnan, more than 8,000 in Sichuan, more than 7,000 in Guizhou, and more than 600 in Guangxi. However, these figures are misleading—most of the glyphs are simple variants of the same abstract character. For example, a 1989 dictionary of the Guizhou Yi script contains about 8,000 individual glyphs, but excluding glyph variants reduces this count to about 1,700 basic characters, which is quite close to the figure of 1,840 characters that Aki is reputed to have devised.
#Standardized Yi Script. There has never been a high level of literacy in the traditional Yi script. Usage of the traditional script has remained limited even in modern times because the traditional script does not accurately reflect the phonetic characteristics of the modern Yi language, and because it has numerous variant glyphs and differences from locality to locality.
To improve literacy in Yi, a scheme for representing the Liangshan dialect using the Latin alphabet was introduced in 1956. A standardized form of the traditional script used for writing the Liangshan Yi dialect was devised in 1974 and officially promulgated in 1980. The standardized Liangshan Yi script encoded in Unicode is suitable for writing only the Liangshan Yi dialect; it is not intended as a unified script for writing all Yi dialects. Standardized versions of other local variants of traditional Yi scripts do not yet exist.
The standardized Yi syllabary comprises 1,164 signs representing each of the allowable syllables in the Liangshan Yi dialect. There are 819 unique signs representing syllables pronounced in the high level, low falling, and midlevel tones, and 345 composite signs representing syllables pronounced in the secondary high tone. The signs for syllables in the secondary high tone consist of the sign for the corresponding syllable in the midlevel tone (or in three cases the low falling tone), plus a diacritical mark shaped like an inverted breve. For example, U+A001 YI SYLLABLE IX is the same as U+A002 YI SYLLABLE I plus a diacritical mark. In addition to the 1,164 signs representing specific syllables, a syllable iteration mark is used to indicate reduplication of the preceding syllable, which is frequently used in interrogative constructs.
#Standards. In 1991, a national standard for Yi was adopted by China as GB 13134-91. This encoding includes all 1,164 Yi syllables as well as the syllable iteration mark, and is the basis for the encoding in the Unicode Standard. The syllables in the secondary high tone, which are differentiated from the corresponding syllable in the midlevel tone or the low falling tone by a diacritical mark, are not decomposable.
#Naming Conventions and Order. The Yi syllables are named on the basis of the spelling of the syllable in the standard Liangshan Yi romanization introduced in 1956. The tone of the syllable is indicated by the final letter: “t” indicates the high level tone, “p” indicates the low falling tone, “x” indicates the secondary high tone, and an absence of final “t”, “p”, or “x” indicates the midlevel tone.
With the exception of U+A015, the Yi syllables are ordered according to their phonetic order in the Liangshan Yi romanization—that is, by initial consonant, then by vowel, and finally by tone (t, x, unmarked, and p). This is the order used in dictionaries of Liangshan Yi that are ordered phonetically.
#Yi Syllable Iteration Mark. U+A015 YI SYLLABLE WU does not represent a specific syllable in the Yi language, but rather is used as a syllable iteration mark. Its character properties therefore differ from those for the rest of the Yi syllable characters. The misnomer of U+A015 as YI SYLLABLE WU derives from the fact that it is represented by the letter w in the romanized Yi alphabet, and from some confusion about the meaning of the gap in traditional Yi syllable charts for the hypothetical syllable “wu”.
The Yi syllable iteration mark is used to replace the second occurrence of a reduplicated syllable under all circumstances. It is very common in both formal and informal Yi texts.
#Punctuation. The standardized Yi script does not have any special punctuation marks, but relies on the same set of punctuation marks used for writing modern Chinese in the PRC, including U+3001 IDEOGRAPHIC COMMA and U+3002 IDEOGRAPHIC FULL STOP.
#Rendering. The traditional Yi script used a variety of writing directions—for example, right-to-left in the Liangshan region of Sichuan, and top-to-bottom in columns running from left to right in Guizhou and Yunnan. The standardized Yi script follows the writing rules for Han ideographs, so characters are generally written from left to right or occasionally from top to bottom. There is no typographic interaction between individual characters of the Yi script.
#Yi Radicals. To facilitate the lookup of Yi characters in dictionaries, sets of radicals modeled on Han radicals have been devised for the various Yi scripts. (For information on Han radicals, see “CJK and Kangxi Radicals” in Section 18.1, Han). The traditional Guizhou Yi script has 119 radicals; the traditional Liangshan Yi script has 170 radicals; and the traditional Yunnan Sani Yi script has 25 radicals. The standardized Liangshan Yi script encoded in Unicode has a set of 55 radical characters, which are encoded in the Yi Radicals block (U+A490..U+A4C5). Each radical represents a distinctive stroke element that is common to a subset of the characters encoded in the Yi Syllables block. The name used for each radical character is that of the corresponding Yi syllable closest to it in shape.
#18.8 Nüshu
#18.8.1 Nüshu: U+1B170–U+1B2FF
Nüshu is a siniform script devised by women to write the local Chinese dialect of Jiangyong county in the Xiaoshui Valley of southeastern Hunan province in China. Nüshu means “women’s writing,” and was originally used only by women, many of whom could not write Chinese Han characters. The script appeared in handwritten cloth-bound booklets of poems and songs, called San Chao Shu (三朝書), that were passed down from one “sworn sister” to another upon marriage. Nüshu also was used for other purposes, and on different media. By the late twentieth century, very few women fluent in the script were still alive. National and international attention to Nüshu has led to active efforts to study and preserve the script.
#Structure. Nüshu is written vertically in columns which are laid out from right to left. Although largely based on Chinese Han characters, Nüshu characters typically represent the phonetic values of syllables, with many characters representing several homophonous words. Some signs are used as ideographs.
#Names. Nüshu characters are named sequentially by prefixing the string “NUSHU CHARACTER-” to the code point. The diaeresis is not included in this prefix because of the constraints on letters that can be used in character names.
#Order. The Nüshu characters are ordered by stroke count, then by vowel, consonant, and tone.
#Punctuation. Nüshu has one punctuation mark, U+16FE1 NUSHU ITERATION MARK, located in the Ideographic Symbols and Punctuation block.
#Sources. The Unicode Character Database contains a source data file for Nüshu called NushuSources.txt. This data file contains normative information on the source references for each Nüshu character. NushuSources.txt also contains an informative reading value for each character.
#18.9 Lisu
#18.9.1 Lisu: U+A4D0–U+A4FF
Somewhere between 1908 and 1914 a Karen evangelist from Myanmar by the name of Ba Thaw modified the shapes of Latin characters and created the Lisu script. Afterwards, British missionary James Outram Fraser and some Lisu pastors revised and improved the script. The script is commonly known in the West as the Fraser script. It is also sometimes called the Old Lisu script, to distinguish it from newer, Latin-based orthographies for the Lisu language.
There are 630,000 Lisu people in China, mainly in the regions of Nujiang, Diqing, Lijiang, Dehong, Baoshan, Kunming and Chuxiong in the Yunnan Province. Another 350,000 Lisu live in Myanmar, Thailand and India. Other user communities are mostly Christians from the Dulong, the Nu and the Bai nationalities in China.
At present, about 200,000 Lisu in China use the Lisu script and about 160,000 in the other countries are literate in it. The Lisu script is widely used in China in education, publishing, the media and religion. Various schools and universities at the national, provincial and prefectural levels have been offering Lisu courses for many years. Globally, the script is also widely used in a variety of Lisu literature.
#Structure. There are 40 letters in the Lisu alphabet. These consist of 30 consonants and 10 vowels. Each letter was originally derived from the capital letters of the Latin alphabet. Twenty-five of them look like sans-serif Latin capital letters (all but “Q”) in upright positions; the other 15 are derived from sans-serif Latin capital letters rotated 180 degrees.
Although the letters of the Lisu script clearly derived originally from the Latin alphabet, the Lisu script is distinguished from the Latin script. The Latin script is bicameral, with case mappings between uppercase and lowercase letters. The Lisu script is unicameral; it has no casing, and the letters do not change form. Furthermore, typography for the Lisu script is rather sharply distinguished from typography for the Latin script. There is not the same range of font faces as for the Latin script, and Lisu typography is typically monospaced and heavily influenced by the conventions of Chinese typography.
Consonant letters have an inherent [ɑ] vowel unless followed by an explicit vowel letter. Three letters sometimes represent a vowel and sometimes a consonant: U+A4EA LISU LETTER WA, U+A4EC LISU LETTER YA, and U+A4ED LISU LETTER GHA.
#Tone Letters. The Lisu script has six tone letters which are placed after the syllable to mark tones. These tone letters are listed in Table 18-13, with the tones identified in terms of their pitch contours.
Code | Glyph | Name | Tone |
---|---|---|---|
A4F8 | ꓸ | mya ti | 55 |
A4F9 | ꓹ | na po | 35 |
A4FA | ꓺ | mya cya | 44 |
A4FB | ꓻ | mya bo | 33 |
A4FC | ꓼ | mya na | 42 |
A4FD | ꓽ | mya jeu | 31 |
Each of the six tone letters represents one simple tone. Although the tone letters clearly derive from Western punctuation marks (full stop, comma, semicolon, and colon), they do not function as punctuation at all. Rather, they are word-forming modifier letters.
The first four tone letters can be used in combination with the last two to represent certain combination tones. Of the various possibilities, only “,;” is still in use; the rest are now rarely seen in China. In monospaced fonts where all letters have the same advance width (for example, one em), it is desirable to fit such a combination of tone letters into the advance width of a simple tone letter.
#Other Modifier Letters. Nasalized vowels are denoted by a nasalization mark following the vowel. This word-forming character is not encoded separately in the Lisu script, but is represented by U+02BC MODIFIER LETTER APOSTROPHE, which has the requisite shape and properties (General_Category = Lm) and is used in similar contexts.
A glide based on the vowel A, pronounced as [ɑ] without an initial glottal stop (and normally bearing a 31 low falling pitch), is written after a verbal form to mark various aspects. This word-forming modifier letters is represented by U+02CD MODIFIER LETTER LOW MACRON. In a Lisu font, this modifier letter should be rendered on the baseline, to harmonize with the position of the tone letters.
#Digits and Separators. There are no unique Lisu digits. The Lisu use European digits for counting. The thousands separator and the decimal point are represented with U+002C COMMA and U+002E FULL STOP, respectively. To separate chapter and verse numbers, U+003A COLON and U+003B SEMICOLON are used. These can be readily distinguished from the similar-appearing tone letters by their numerical context.
#Punctuation. U+A4FE “꓾” LISU PUNCTUATION COMMA and U+A4FF “꓿” LISU PUNCTUATION FULL STOP are punctuation marks used respectively to denote a lesser and a greater degree of finality. These characters are similar in appearance to sequences of Latin punctuation marks, but are not unified with them.
Over time various other punctuation marks from European or Chinese traditions have been adopted into Lisu orthography. Table 18-14 lists all known adopted punctuation, along with the respective contexts of use.
Code | Glyph | Name | Context |
---|---|---|---|
002D | - | hyphen-minus | syllable separation in names |
003F | ? | question mark | questions |
0021 | ! | exclamation mark | exclamations |
0022 | " | quotation mark | quotations |
0028/0029 | ( ) | parentheses | parenthetical notes |
300A/300B | 《 》 | double angle brackets | book titles |
2026 | … | ellipsis | omission of words (always doubled in Chinese usage) |
U+2010 HYPHEN may be preferred to U+002D HYPHEN-MINUS for the dash used to separate syllables in names, as its semantics are less ambiguous than U+002D.
The use of the U+003F “?” QUESTION MARK replaced the older Lisu tradition of using a tone letter combination to represent the question prosody, followed by a Lisu full stop: “..:=”
#Line Breaking. A line break is not allowed within an orthographic syllable in Lisu. A line break is also prohibited before a punctuation mark, even if it is preceded by a space. In general there is no hyphenation of words across line breaks, except for proper nouns, where a break is allowed after the hyphen used as a syllable separator.
#Word Separation. The Lisu script separates syllables using a space or, for proper names, a hyphen. In the case of polysyllabic words, it can be ambiguous as to which syllables join together to form a word. Thus for most text processing at the character level, a syllable (starting after a space or punctuation and ending before another space or punctuation) is treated as a word except for proper names—where the occurrence of a hyphen holds the word together.
#18.10 Miao
#18.10.1 Miao: U+16F00–U+16F9F
The Miao script, also called Lao Miaowen (“Old Miao Script”) in Chinese, was created in 1904 by Samuel Pollard and others, to write the Northeast Yunnan Miao language of southern China. The script has also been referred to as the Pollard script, but that usage is no longer preferred. The Miao script was created by an adaptation of Latin letter variants, English shorthand characters, Miao pictographs, and Cree syllable forms. (See Section 20.2, Canadian Aboriginal Syllabics.) Today, the script is used to write various Miao dialects, as well as languages of the Yi and Lisu nationalities in southern China.
The script was reformed in the 1950s by Yang Rongxin and others, and was later adopted as the “Normalized” writing system of Kunming City and Chuxiong Prefecture. The main difference between the pre-reformed and the reformed orthographies is in how they mark tones. Both orthographies can be correctly represented using the Miao characters encoded in the Unicode Standard.
#Implementation Guidelines. Extensive guidelines for the implementation of the Miao script can be found in Unicode Technical Note #56, Representing Miao in Unicode. That document provides information on the encoding order of syllables, on rendering, and on glyph variants. (Unicode Technical Notes do not have normative status for the Unicode Standard.)
#Encoding Principles. The script is written left to right. The basic syllabic structure contains an initial consonant or consonant cluster and a final. The final consists of either a vowel or vowel cluster, an optional final nasal, plus a tone mark. The initial consonant may be preceded by U+16F50 MIAO LETTER NASALIZATION, and can be followed by combining marks for voicing (U+16F52 MIAO SIGN REFORMED VOICING) or aspiration (U+16F51 MIAO SIGN ASPIRATION and U+16F53 MIAO SIGN REFORMED ASPIRATION).
The Gan Yi variety of Miao has an additional combining mark, U+16F4F MIAO SIGN CONSONANT MODIFIER BAR. That mark is only applied to two consonants, U+16F0E MIAO LETTER TTA or U+16F10 MIAO LETTER NA, indicating a distinct place of articulation. The mark follows the consonant in logical order, as for all combining marks, but is rendered with a small vertical bar at the lower left-hand side of the modified consonant.
#Tone Marks. In the Chuxiong reformed orthography, vowels and final nasals appear on the baseline. If no explicit tone mark is present, this indicates the default tone 3. An additional tone mark, encoded in the range U+16F93..U+16F99, may follow the vowel to indicate other tones. A set of archaic tone marks used in the reformed orthography is encoded in the range U+16F9A..U+16F9F.
In the pre-reformed orthography, such as that used for the language Ahmao (Northern Hmong), the tone marks are represented in a different manner, using one of five shifter characters. These are represented in sequence following the vowel or vowel sequence and indicate where the vowel letter is to be rendered in relation to the consonant. If more than one vowel letter appears before the shifter, all of the vowel glyphs are moved together to the appropriate position.
#Rendering of “wart”. Several Miao consonants appear in the code charts with a “wart” attached to the glyph, usually on the left-hand side. In the Chuxiong orthography, a dot appears instead of the wart on these consonants. Because the user communities consider the appearance of the wart or dot to be a different way to write the same characters and not a difference of the character’s identity, the differences in appearance are a matter of font style.
#Ordering. The order of Miao characters in the code charts derives from a reference ordering widely employed in China, based in part on the order of Bopomofo phonetic characters. The expected collation order for Miao strings varies by language and user communities, and requires tailoring. See Unicode Technical Standard #10, “Unicode Collation Algorithm.”
#Digits. Miao uses European digits.
#Punctuation. The Miao script employs a variety of punctuation marks, both from the East Asian typographical tradition and from the Western typographical tradition. There are no script-specific punctuation marks.
#18.11 Tangut
#18.11.1 Tangut: U+17000–U+187FF
#Tangut Supplement: U+18D00–U+18D7F
Tangut, also known as Xixia, is a large, historic siniform ideographic script used to write the Tangut language, a Tibeto-Burman language spoken from about the 11th century CE until the 16th century in the area of present-day northwestern China. The Tangut script was created under the first emperor of Western Xia about 1036 CE. After the fall of the Western Xia to the Mongols, the script continued to be used during the Yuan and Ming dynasties, but it had become obsolete by the end of Ming dynasty. Tangut was re-discovered in the late 19th century, and has been largely deciphered, thanks to the ground-breaking work done in the early 20th century by N. A. Nevskij. Tangut is found in thousands of official, private, and religious texts, including books and sutras, inscriptions, and manuscripts. Today the study of Tangut is a separate discipline, with scholars in China, Japan, Russia, and other countries publishing works on Tangut language and culture.
#Structure. Tangut characters superficially resemble Chinese ideographs; however, the script is unique and unrelated to Chinese ideographs. Tangut was originally written top to bottom, with columns laid out right to left, in the same manner as Chinese was traditionally written. In current practice, the script is written horizontally left to right. Most Tangut characters are made up of 8 to 15 strokes. The script has no combining characters.
#Encoding Principles. The repertoire of Tangut characters is intended to cover all Tangut characters used as head entries or index entries in the major works of modern Tangut lexicography and scholarship. A number of principles have been adopted to handle variant glyph shapes, because Tangut characters are often written with different glyph shapes in the primary sources. When character variants are not used contrastively in a single source reference, they are unified as a single character, typically using the glyph found in Li Fanwen 2008. However, if a single source includes two or more variants as separate head or index entries, then the variants are encoded as separate characters. In cases where two characters with the same shape are cataloged separately in a single source, but have different pronunciations or meanings, only one character is encoded. Also, a few erroneous or “ghost” characters in modern dictionaries are separately encoded.
The Tangut Supplement block contains additional Tangut ideographs that did not fit within the initial allocation range for the Tangut block. In some cases, these additional ideographs are disunifications resulting from scholarly analysis of some components that have very closely-related graphical appearances.
#Character Names. The names for the Tangut characters are algorithmically derived by prefixing the code point with the string “TANGUT IDEOGRAPH-”. Hence the name for U+17000 is TANGUT IDEOGRAPH-17000.
#Punctuation. Contemporary sources use U+16FE0 TANGUT ITERATION MARK, located in the Ideographic Symbols and Punctuation block. There are no other script-specific punctuation marks.
#Sources. The Unicode Character Database contains a source data file for Tangut called TangutSources.txt. This data file contains normative information on the source references for each Tangut character. TangutSources.txt also contains the informative radical-stroke values for each character. The data in TangutSources.txt shares the same format as the Unihan data files in the UCD. The Tangut code chart also indicates the source reference and the radical-stroke value for each character.
#Sorting. No universally accepted or standard character sort order exists for Tangut. All extant Tangut dictionaries dating to the Western Xia period (1038-1227) base their ordering on phonetic principles, which do not help in locating specific characters. Almost all modern Tangut dictionaries and glossaries order characters by radical and stroke count. However, the radical/stroke indices in modern handbooks all differ from one another. The radical system adopted in the Tangut block is based on that of Han Xiaomang 2004, with some modifications. In the Tangut block, signs are grouped by radical, and radicals are ordered by stroke count and stroke order. Within each radical, signs are ordered by stroke count and stroke order.
#Stroke Order. Because current day Tangut dictionaries do not provide information on how Tangut characters should be written or on their stroke count, modern scholars have reconstructed stroke count and stroke order based on the analogy to Chinese characters. The stroke order used by scholars may not reflect the actual stroke order used by Tangut scribes.
#18.11.2 Tangut Components: U+18800–U+18AFF
Tangut characters are composed of structural elements called components. The components and stroke order are used by scholars to index Tangut ideographs in modern dictionaries and glossaries. The components are also used to describe and analyze Tangut ideographs.
Because there is no single standard set of components, different scholars have devised their own systems. The Tangut Components block represents a unification of seventeen Chinese, Japanese, Russian, and English language dictionaries of Tangut and other publications. All components used in important recent Tangut dictionaries are included, as well as an additional 24 components required for describing Tangut ideographs. The components can be used in Ideographic Description Sequences (IDS) to describe Tangut ideographs.
#Repertoire. A total of 755 components are encoded. Of these, 505 components function as radicals under which the Tangut ideographs are ordered. Some sources use single strokes to describe or to index characters. In some cases, these single strokes are encoded as components (U+18900..U+18909), but other single strokes may be represented using the corresponding character from the CJK Strokes block instead.
#Names. The characters in the Tangut Components block are named sequentially by prefixing the string “TANGUT COMPONENT-” to a three digit numerical sequence code. Hence, the names range from TANGUT COMPONENT-001 through TANGUT COMPONENT-755.
#Order. The Tangut components are ordered by stroke count and stroke order.
#Radical-Stroke Values. The Unicode Character Database contains the Tangut radical-stroke values for each character in the data file TangutSources.txt. This data is informative, and is in the same format as Unihan. The Tangut code chart also indicates the source reference and the radical-stroke value for each character.
#18.12 Khitan Small Script
#18.12.1 Khitan Small Script: U+18B00–U+18CFF
Khitan Small Script was one of two scripts used by the Khitan people of Northern China to write the Khitan language during the Liao dynasty (907–1125 CE), the Qara Khitai empire (or Western Liao dynasty, 1124–1218), and the Jin dynasty (1115–1234). The other script is known as Khitan Large Script. Both scripts are only partially deciphered today but were used over the same time period, in the same geographical area, and for the same functions.
Khitan Small Script was created about 925 by Yelü Diela, and its creation is said to have been inspired by the Uyghur script, although there appear to be few similarities between the two scripts. The main source of texts in Khitan Small Script are funerary epitaphs engraved on stone tablets and buried with members of Khitan royalty and aristocracy. The script also appears on walls and monuments, as well as on bronze mirrors, tallies, non-circulation coins, and a single jade cup.
#Structure. The Khitan Small Script contains logograms and phonograms written in vertical columns, running right to left, similar to how Chinese is traditionally written. The logograms generally appear on their own, and phonograms typically combine into clusters of two to eight characters to represent an individual word.
A small number of frequently occurring logograms represent numbers, calendrical terms, kinship terms, and so on. Some of these may appear with dotted and undotted forms. The dotted forms are thought to indicate masculine gender, while the undotted forms indicate feminine gender or are gender-neutral.
Most Khitan words are written phonetically with characters that represent consonants, vowels, diphthongs or syllables. The phonetic values of many of the phonograms have been reconstructed, but many values are still unknown. A few characters seem to act both as logograms and phonograms.
#Character Names. The Khitan Small Script characters are named sequentially by prefixing “KHITAN SMALL SCRIPT CHARACTER-” to the code point, with the exception of one format control character, U+16FE4 KHITAN SMALL SCRIPT FILLER. The filler character is located in the Ideographic Symbols and Punctuation block.
#Phonogram Clusters. Phonograms may occur in isolation, but typically, two or more phonograms combine into a cluster representing a single word of one or more syllables. Within the cluster, the characters are ordered from left to right and then from top to bottom. Less often, a phonogram starts with a single centered character at the top. Some logograms may take a grammatical suffix and therefore appear as the first character in a phonogram cluster.
There are two cluster patterns in Khitan Small Script. The prevalent pattern, Type A, starts with two side-by-side adjacent Khitan Small Script characters, and ends with either a single centered character or two additional side-by-side adjacent characters. The alternate pattern, Type B, occurs occasionally. It starts with a single, centered Khitan Small Script character at the top, usually followed by two, sometimes three, and very rarely more than three characters, as shown in Figure 18-17. The two patterns seem to be a stylistic choice, rather than a semantic distinction.
The original Khitan Small Script texts show a narrow gap between clusters, between clusters and standalone characters, and often between adjacent standalone characters. Modern scholarly transcriptions of texts generally show a clear gap between standalone characters and sequences of characters. To indicate the gap, U+0020 SPACE should be used.
Clusters of Type A are predominant. A rendering system should lay out clusters of this type automatically, by default. To indicate clusters of Type B, the format character U+16FE4 KHITAN SMALL SCRIPT FILLER is used, placed directly after the first character.
Additional rendering support is required to lay out Khitan Small Script in the various attested orientations: in clusters within vertical text, in clusters within left-to-right horizontal text, or simply character-by-character in a horizontal, linear format.
#Iteration Mark. Khitan Small Script contains an iteration mark, U+18B00 KHITAN SMALL SCRIPT CHARACTER-18B00. This mark indicates that the preceding cluster is repeated in reading.
#Obscured or Missing Characters. Occasionally a Khitan Small Script character may be obscured or missing in source materials, often as a result of damage to inscriptions. In such cases, U+18CFF KHITAN SMALL SCRIPT CHARACTER-18CFF can be used to represent the obscured or missing character. The representative glyph for U+18CFF is a white square box, but it may also be shown with dashed or dotted edges. This symbolic indicator of a missing character participates in Khitan Small Script cluster rendering behavior, and so the aspect and/or size of the box may vary, depending on how the clusters are rendered in context.