Chinese and Japanese
Q: What does the term "CJK" mean?
It is a commonly used acronym for "Chinese, Japanese, and Korean". The term "CJK character" generally refers to "Chinese characters", or more specifically, the Chinese (= Han) ideographs used in the writing systems of the Chinese and Japanese languages, occasionally for Korean, and historically in Vietnam.
Q: Are Chinese characters used in Korean?
Yes, but mostly for older and traditional literary materials. Modern Korean is written almost entirely with a separate system of Hangul characters constructed of smaller pieces called jamo letters.
Q: Where can I find out more about Hangul and jamo characters for Korean?
There is a separate FAQ on Korean dealing with Hangul and jamo characters.
Q: I have heard that UTF-8 does not support some Japanese characters. Is this correct?
There is a lot of misinformation floating around about the support of Chinese, Japanese and Korean (CJK) characters. The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.
Unicode supports over 80,000 CJK characters right now, and work is underway to encode further additions. The International Standard ISO/IEC 10646 and the Unicode Standard are completely synchronized in repertoire and content. And that means that Unicode has the same repertoire as GB 18030, since that also is synchronized with ISO 10646 — although with a different ordering and byte format.
Q: Who is responsible for future CJK characters?
The development and extension of the CJK characters is being done by the Ideographic Research Group (IRG), which includes official representatives of China, Hong Kong (SAR), Macao (SAR), Singapore, Japan, South Korea, North Korea, Taiwan and Vietnam, plus a representative from the Unicode consortium. For more information, see the IRG home page.
The IRG is very carefully cataloging, reviewing, and assessing CJK characters for inclusion into the standard. The only real limitation on the number of CJK characters in the standard is the ability of this group to process them, because the characters are increasingly obscure (no person knows more than a fraction of the set already encoded).
Q: What is the process for proposing new CJK unified ideographs?
Newly proposed CJK unified ideographs are first submitted to the IRG through national bodies or liaison organizations, and are then assembled into a new "IRG Working Set" that goes through several rounds of detailed review and scrutiny before being approved for standardization as a new CJK unified ideographs extension. Individuals who wish to propose the encoding of new CJK unified ideographs are encouraged to work with their respective country's national body.
Q: Does the Unified Han character encoding in Unicode mean that I only need one CJK font for Asia, or do I have to allow for choices between different styles of CJK fonts for different countries?
Broadly speaking, there are four traditions for character shapes in East Asia: traditional Chinese (used primarily in Taiwan, Hong Kong, and overseas Chinese communities), simplified Chinese (used primarily in mainland China and Singapore), Japanese, and Korean. Using a single font for all four locales allows the characters to be legible, but means that some characters may look odd. For optimal results a system localized for use in Japan, for example, should use a font designed explicitly for use with Japanese, rather than a generic Unihan font. [JJ]
Q: If the character shapes are different in different parts of East Asia, why were the characters unified?
The Unicode Standard is designed to encode characters, not glyphs. Even where there are substantial variations in the standard way of writing a character from locale to locale, if the fundamental identity of the character is not in question, then a single character is encoded in Unicode.
This principle applies to East Asian scripts as well as to those of other parts of the world. It is well-recognized that the Han characters involved are the same, even when used in different countries to write different languages. In the overwhelming majority of cases where a Han character is written differently in different locales, readers from one locale would recognize the form used in another; in all cases, experts from throughout East Asia would recognize the fundamental unity of the character.
As a rule, the differences in writing style between the different East Asian locales are within the general range of allowable differences within each typographic tradition.
E.g., the official "Taiwanese" glyph for 草 U+8349 ("grass") per ISO/IEC 10646 uses four strokes for the "grass" radical, whereas the PRC, Japanese, and Korean glyphs use three. As it happens, Apple's LiSung Light font for Big Five (which follows the "Taiwanese" typographic tradition) uses three strokes, as shown here:
Japanese users prefer to see Japanese text written with "Japanese" glyphs.
There are occasional instances of unified characters whose typical Chinese glyph and typical Japanese glyph are distinct enough that the Chinese glyph will be unfamiliar to the typical Japanese reader, e.g., 直 U+76F4. To prevent legibility problems for Japanese readers, it is advisable to use a Japanese-style font when presenting Unihan text to Japanese readers.
It is also typical for Japanese users to see Chinese text written with "Japanese" glyphs. For example:
A standard Japanese dictionary which quotes Chinese authors (e.g., Mencius) uses "Japanese" glyphs, not Chinese ones.
In particular, it is perfectly acceptable within Japanese typography for stretches of Chinese quoted in a predominantly Japanese text to be written with "Japanese" glyphs.
Han Unification is designed to preserve legibility. Documents typically can be simply displayed in the font preferred by the user. Where a distinction in style needs to be made (for example, Chinese-style vs. Japanese-style glyphs in the same document), appropriate fonts should be applied to the specific text as needed.
Because of limitations in existing fonts, it may occasionally happen that a rare kanji will be displayed using a Chinese-style glyph where a Japanese-style glyph would be preferred. This is a font issue, not a character encoding issue, and the same problem can occur with other character encoding standards.
For more information, see On the Encoding of Latin, Greek, Cyrillic, and Han. [JJ]
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
It's basically impossible and largely meaningless. It's the equivalent of asking if "a" is an English letter or a French one. There are some characters where one can guess based on the source information in the Unihan Database that it's traditional Chinese, simplified Chinese, Japanese, Korean, or Vietnamese, but there are too many exceptions to make this really reliable.
The phonetic data in the Unihan Database should not be used for this purpose. A blank in the phonetic data means that nobody's supplied a reading, not that a reading doesn't exist. Because updating the Unihan Database is an ongoing process, these fields will be increasingly filled out as time goes on, but they should never be taken as absolutely complete. In particular, there are obscure characters where it is known that there is a reading, but since the character does not occur in standard dictionaries, we are unable to supply it (e.g., 䃟 U+40DF in Cantonese).
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
The only proper mechanism is, as for determining whether "chat" is spelled correctly in English or French, is to use a higher-level protocol. [JJ]
Q: How does character input on a keyboard work for Chinese characters?
This is a complicated question. For answers, see How are Chinese characters input?
Q: Why is Unicode missing some characters from the Big Five character set?
The "Big Five" character set is an industrial standard commonly used for traditional Chinese. There are, however, several versions of the Big Five in common use, generally representing extensions of the formal standard. There are two main versions, "plain Big Five" and "ETEN Big Five" as well as numerous vendor- or platform- specific extensions. In recent years, there have been further extensions such as the Hong Kong Extension to Big Five and Big Five Plus.
The initial, un-extended Big Five was the standard version of the character set at the time that the Unicode Standard, Version 1.0, was under development, and Unicode was designed to cover its ideographic repertoire completely. This is reflected in the data files supplied by the Unicode Consortium. Some vendors provide vendor-specific tables showing mapping data for their custom Big Five extensions and Unicode. The Unicode Consortium does not, however, provide data on every known dialect of the Big Five, so it is possible that a particular dialect of the Big Five is not included in the tables provided by Unicode. [JJ]
Q. I hear that certain characters from the GB18030 encoding are not mapped to any code points in Unicode, and need to be mapped to characters in the Private Use Area instead. Is this true? And if so, is the issue being dealt with in the near future?
That used to be true, as of Unicode 4.0. There were in fact a small number of characters in GB 18030 that had not made it into Unicode (and ISO/IEC 10646). However, to avoid having to map characters to the PUA for support of GB18030, the missing characters were added as of Unicode 4.1, so of course they are also in Unicode 5.0 and later versions.
You can find the characters in question in Annex C (p. 92) of GB 18030-2000. All now have regular Unicode characters. These can be found in the ranges: U+31C0..U+31CF (for CJK strokes) and U+9FA6..U+9FBB (for various CJK characters and components).
Q: Isn't it true that some Japanese can't write their own names in Unicode?
There are some situations where an individual prefers their name be written with a specific glyph, as in the West we have John and Jon, Mark and Marc, Cathy and Kathy. In most cases, variation sequences in the UTS# 37 Unicode Ideographic Variation Database can be used to provide the required representation in plain text. In other cases, the variant forms have been encoded in Unicode as distinct characters. The IRG also may consider where the encoding of new variant characters is justified.
It should be noted that this is not a problem of Han unification per se, as it is often represented. Unicode is a superset of the major Japanese character encoding standards. The various JIS standards and ISO 2022-based encodings have the same limitation. [JJ]
Q: Where can I find a Unicode mapping for EACC?
EACC is an American National Standard, East Asian Character Code for Bibliographic Use (ANSI/NISO Z39.64), developed by the library community. The Library of Congress specifies use of EACC for CJK data in MARC 21 records that do not use UTF-8. The Unicode-EACC mapping approved by the MARBI Committee of the American Library Association is available on the MARC 21 Web site. [JA]
Q: Why doesn't the Unihan database include mappings for all EACC characters?
The Unihan database covers only the ideographs in the Unicode Standard. EACC also includes characters such as Japanese kana and Korean hangul that are outside the scope of the Unihan database. [JA]
Q: What is JIS X0213?
JIS X0213, 7-bit and 8-bit double byte coded extended Kanji sets for information interchange, is a new Japanese national standard coded character set established by JISC (Japanese Industrial Standards Committee). It was established in January 2000, then revised in February 2004. It enumerates 11,233 characters, which extends the 4,344 characters of the JIS X0208 standard. It consists of 10,050 Kanji (ideographic) characters and 1,183 non-Kanji (non-ideographic) characters. These characters are arranged in two planes of a 94-row-by-94-cell matrix. Also, as an informative annex, three encoding methods are defined as extensions of existing de facto encodings, that is, Shift JIS, EUC-JP, and ISO-2022-JP. [TO]
Q: How is JIS X0213 related to some existing JIS standards?
There are several JIS coded character set standards. JIS X0201 is the single-byte coded character set which adapts the ISO/IEC 646 standard in Japan. JIS X0208, JIS X0212 and JIS X0213 are the double-byte coded character sets, and JIS X0221 is the multi-byte coded character set which corresponds to ISO/IEC 10646. JIS X0208 is the primary double-byte coded character set used for Japanese. Although both JIS X0212 and JIS X0213 Kanji standards have been established as the supplement to JIS X0208 standard, the scopes of their source character sets are different. [TO]
Q: How is JIS X0213 related to Unicode / ISO/IEC 10646?
Almost all characters in JIS X0213 have corresponding characters in Unicode / ISO/IEC 10646. Only a few non-Kanji characters are represented by composite sequences in Unicode / ISO/IEC 10646. Kanji characters are mapped to one of the blocks of CJK Unified Ideographs, CJK Compatibility Ideographs, CJK Unified Ideographs Extension A, or CJK Unified Ideographs Extension B in Unicode 4.0 (or later versions) and corresponding versions of ISO/IEC 10646; or are mapped to CJK Compatibility Ideographs. [TO]
Q: Where to get more information about JIS X0213?
For more information about JIS X0213 standard, contact the Japanese Standards Association. [TO]
Q: I have heard there are problems in Japanese and other East Asian mapping tables. Where can I find information about these problems?
There are many well-known mapping problems and discrepancies. For example:
Shift-JIS bytes <0x81 0x5C> can be mapped to U+2014 or U+2015, which look almost the same.
Shift-JIS bytes <0x87 0x82> and <0xFA 0x59> can both be mapped to U+2116, but the primary roundtrip mapping may be different between platforms. That is, what U+2116 maps back to may be different.
Sometimes the standard is ill defined, and each vendor has a choice in how to implement the Unicode mapping table. Examples include the Big5-HKSCS and several other codepages. Sometimes the mapping table varies, even on the same platform. For example, Windows-950 is either Big5 or Big5-HKSCS, and the later one depends on the user applying a Windows specific patch. Implementations of ISO 2022 encodings like ISO-2022-JP differ not only in the mapping tables for the sub-encodings but also in the supported sets of escape sequences and their invocation pattern.
The W3C has an extensive technical report "XML Japanese Profile" which lists a number of known mapping problems. Of special interest to people with mapping problems are Appendix C, Ambiguities in conversion from Shift-JIS to Unicode and Appendix D, Ambiguities in conversion from Japanese EUC to Unicode.
The ICU project contains many mapping tables for a variety of standards.See the ICU User Guide, particularly the section on Conversion Data. The page Character Set Mapping Tables shows a detailed comparison between a number of different charsets, based on data collected on different platforms.
The obsolete, unmaintained East Asian Mapping Tables on the Unicode website also contain some notes about specific discrepancies. There is an extensive article at Debian by Tomohiro Kubota on these problems: Conversion tables differ between vendors. The article contains a table of discrepancies in various Japanese encodings.
For more information on character mappings and roundtripping issues, see UTS #22: Unicode Character Mapping Markup Language. [GR]
Q: Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points?
The Han ideographic script is largely compositional in nature. The overwhelming number of characters created over the centuries (and still being coined) are made by adjoining two or more old characters in simple geometric relationships. For example, the Cantonese-specific character U+55F0 嗰 was created by adjoining the two older characters, U+53E3 口 and U+500B 個, one next to the other.
The compositional nature of the script—and, more to the point, the fact that this compositional nature is well-known—means that over time tens of thousands of ideographs have been created, and these are currently encoded in Unicode by using one code point per ideograph. The result is that over 80,000 code points are used for ideographs in the Unicode Standard—over two-thirds of the characters encoded.
The compositional nature of the script makes it attractive to propose a compositional encoding model, such as can be used for Hangul. Such a mechanism would result in the savings of thousands of code points and relieve the IRG from the burden of having to examine potential candidates for encoding.
Unfortunately, there are some difficulties involved with a compositional model for Han.
First of all, while the rules for drawing composed Jamos as Hangul syllables are relatively straightforward, those for Han are surprisingly complex. To use U+55F0 嗰 as an example again, although it is built structurally out of two pieces, the left piece occupies far less than 50% of the character's horizontal space. This reduction in size is a result of the nature of U+53E3 口 itself and doesn't apply to other characters. Either the rendering process would have to be sophisticated enough to take such ideographic idiosyncrasies into account, or the encoding model would have to provide more information than just the geometric relationship between the composing pieces. (This is the main reason why the existing Ideographic Description Sequence mechanism is inadequate even for drawing described ideographs.)
Even more difficult is the problem of normalization, which would be necessary for operations such as comparison or searching. A normalization algorithm would first have to parse the sequence of composing Han for validity, and then make sure that all substrings are normalized. It should also to be able to recognize a "canonical" form for a sequence of composing Han. Thus, U+55F0 嗰 could be spelled using three pieces (U+53E3 口, U+4EBB 亻, U+56FA 固) as well as with two. Indeed, since U+4EBB 亻 is a well-known variant form of U+4EBA 人, it could be spelled using that character, as well. Providing a canonical representation would have to take these multiple spellings into account.
The open-ended nature of the script and possibilities for ambiguous spelling make it virtually impossible to guarantee that two characters made up by two different people would be treated as equivalent even if they look exactly the same and are intended to be equivalent.
Other computer processes such as machine-based translation or text-to- speech would probably have to skip such characters when they occur in plain text, because there is no simple, authoritative way for these processes to be able to determine even approximate definitions or pronunciations from the visual representation alone. Even if the data are available, the need to parse strings of variable length before looking them up creates complications.
Finally, East Asian governments, while aware of the compositional nature of the script, do not wish to actively encourage the coining of new forms because of the practical problems they create. In particular, new coinages are rarely an aid to communication, since they have no obvious inherent meaning or pronunciation. They are little more than dingbats littering otherwise intelligible text.
While the number of encodable ideographs has proven far greater than Unicode had originally anticipated, the standard is in no danger of running out of room for them any time soon. 80,000 ideographs encoded in 25 years amounts to just over 3,200 ideographs per year. At this rate, it would take over 250 years to fill up the available space in Unicode with ideographs.
And while the number of unencoded but useful ideographs is larger than originally anticipated, it is also finite and probably smaller than the number of ideographs already encoded. The bulk of useful unencoded forms is likely to come from placenames, personal names, or characters needed for Chinese dialects other than Mandarin and Cantonese. Many unencoded forms occurring in existing texts are actually variants of encoded characters and would best be represented as such.
While it currently takes several years for the IRG to fully process proposed ideographs so that they can be encoded, steps are being taken to streamline this, and further steps will be possible in the future should they prove necessary. Indeed, the bulk of the work currently done by the IRG would still have to be done for composed ideographs in order to provide support for them beyond rendering. [JJ]
Q: Why does Unicode use the term "ideograph" when it is linguistically incorrect?
The characters used to write Chinese are traditionally called "Chinese characters" in the various East Asian languages (hanzi in Mandarin, kanji in Japanese and hanja in Korean). In English, they are generally referred to by names such as "ideograph" or "pictogram," even though these don't accurately reflect what the characters are or how they are used. Indeed, no single linguistic term adequately describes these characters because they have such varied origins and uses. The only possible exception would be "sinogram," which is Latin for "Chinese character" and rarely found.
Unicode originally adopted the word "ideograph" as representing common English usage. The term is now so pervasive in the standard that it cannot be abandoned. [JJ]
Q: What is the difference between the Unicode character properties "Ideographic" and "Unified_Ideograph"?
The Unified_Ideograph property (short name: UIdeo) is used to specify the exact set of CJK Unified Ideographs in the standard. In other words, it is applied only to unified ideographs, and not to compatibility ideographs or other characters that behave like ideographs, and it is only used specifically for the CJK ideographs—characters in the Han script.
The Ideographic property (short name: Ideo), on the other hand, is extended to all CJK ideographs—not just the unified ones. It also applies to certain characters, such as U+3007 IDEOGRAPHIC NUMBER ZERO, which behave like CJK ideographs, even though they are not formally considered part of the CJK Unified Ideographs. Furthermore, the Ideographic property is not constrained to apply only to Han script characters.
Q: What's the best way to find out how to pronounce an ideograph in Mandarin?
Most Chinese characters have only one pronunciation in any given variety of spoken Chinese; and roughly half of those with multiple pronunciations have only one in general use. Exceptions are common enough, however, that text-to-speech engines dealing with extended runs of text had best take semantics and context into account when determining the correct pronunciations.
In situations where sufficient context is unavailable, the best pronunciation to use with Mandarin is the one indicated with the kMandarin field. This is derived algorithmically from the kHanyuPinlu, kXHC1983, and kHanyuPinyin fields, with corrections and additions hand-supplied by experts in China. Where the kMandarin field provides multiple readings, the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). [JJ]
Q: How does this relate to the pinyin ordering and transliteration in CLDR?
The kMandarin field is used for pinyin ordering and transliteration in CLDR.
Q: Why are DPRK (North Korean == kIRG_KPSource) glyphs missing from some CJK code charts?
A font is not currently available for representation of kIRG_KPSource glyphs in the main CJK Unified Ideographs block, Ext A, and Ext B. It is possible that KP-Source glyphs will appear in future code charts, if suitable font data becomes available. [RC]