Chinese and Japanese
Q: What does the term "CJK" mean?
A: It is a commonly used acronym for "Chinese, Japanese, and Korean". The term "CJK character" generally refers to
"Chinese characters", or more specifically, the Chinese (= Han) ideographs used in the writing systems of the Chinese and Japanese languages, occasionally for Korean, and historically in Vietnam.
Q: Are Chinese characters used in Korean?
A: Yes, but mostly for older and traditional literary materials. Modern Korean is written almost entirely with a separate system of Hangul characters constructed of smaller pieces called jamo letters.
Q: Where can I find out more about Hangul and jamo characters for Korean?
A: There is a separate FAQ on Korean dealing with Hangul and jamo characters.
Q: I have heard that UTF-8 does not support
some Japanese characters. Is this correct?
A: There is a lot of misinformation floating around about the
support of Chinese, Japanese and Korean (CJK) characters. The Unicode
Standard supports all of the CJK characters from JIS X 0208, JIS X
0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true
no matter which encoding form of Unicode is used: UTF-8, UTF-16, or
UTF-32.
Unicode supports over 80,000 CJK characters right now, and work is
underway to encode further additions. The International Standard ISO/IEC
10646 and the Unicode Standard are completely synchronized in repertoire
and content. And that means that Unicode has the same repertoire as GB
18030, since that also is synchronized with ISO 10646 — although with a
different ordering
and byte format.
Q: Who is responsible for future CJK
characters?
A: The development and extension of the CJK characters is
being done by the Ideographic Research Group (IRG), which includes
official representatives of China, Hong Kong (SAR), Macao (SAR),
Singapore, Japan, South Korea, North Korea, Taiwan and Vietnam, plus a
representative from the Unicode consortium. For more information, see the
IRG home page.
The IRG is very carefully cataloging, reviewing, and
assessing CJK characters for inclusion into the standard. The only real
limitation on the number of CJK characters in the standard is the ability
of this group to process them, because the characters are increasingly
obscure (no person knows more than a fraction of
the set already encoded).
Q: What is the process for proposing new CJK unified ideographs?
A: Newly proposed CJK unified ideographs are first submitted to the IRG through national bodies or liaison organizations, and are then assembled into a new "IRG Working Set" that goes through several rounds of detailed review and scrutiny before being approved for standardization as a new CJK unified ideographs extension. Individuals who wish to propose the encoding of new CJK unified ideographs are encouraged to work with their respective country's national body.
Q: Does the Unified Han character encoding in
Unicode mean that I only need one CJK font for Asia, or do I have to allow
for choices between different styles of CJK fonts for different countries?
A: Broadly speaking, there are four traditions for character
shapes in East Asia: traditional Chinese (used primarily in Taiwan, Hong
Kong, and overseas Chinese communities), simplified Chinese (used
primarily in mainland China and Singapore), Japanese, and Korean. Using a
single font for all four locales allows the characters to be legible, but
means that some characters may look odd. For optimal results a system
localized for use in Japan, for example, should use a font designed
explicitly for use with Japanese, rather than a generic Unihan font.
[JJ]
Q: If the character shapes are different in
different parts of East Asia, why were the characters unified?
A: The Unicode Standard is designed to encode characters, not
glyphs. Even where there are substantial variations in the standard way of
writing a character from locale to locale, if the fundamental identity of
the character is not in question, then a single character is encoded in
Unicode.
This principle applies to East Asian scripts as well as to
those of other parts of the world. It is well-recognized that the Han
characters involved are the same, even when used in different
countries to write different languages. In the overwhelming majority of
cases where a Han character is written differently in different locales,
readers from one locale would recognize the form used in another; in all
cases, experts from throughout East Asia would recognize the fundamental
unity of the character.
As a rule, the differences in writing style between the
different East Asian locales are within the general range of allowable
differences within each typographic tradition.
-
E.g., the official "Taiwanese" glyph for
草 U+8349 ("grass") per ISO/IEC 10646 uses four strokes for the "grass"
radical, whereas the PRC, Japanese, and Korean glyphs use three. As it
happens, Apple's LiSung Light font for Big Five (which follows the
"Taiwanese" typographic tradition) uses three strokes, as shown here:
Japanese users prefer to see Japanese text written with
"Japanese" glyphs.
-
There are occasional instances of unified characters whose
typical Chinese glyph and typical Japanese glyph are distinct enough
that the Chinese glyph will be unfamiliar to the typical Japanese
reader, e.g., 直 U+76F4. To prevent legibility problems for Japanese
readers, it is advisable to use a Japanese-style font when presenting
Unihan text to Japanese readers.
It is also typical for Japanese users to see Chinese text
written with "Japanese" glyphs. For example:
-
A standard Japanese dictionary which quotes Chinese authors
(e.g., Mencius) uses "Japanese" glyphs, not Chinese ones.
-
In particular, it is perfectly acceptable within Japanese
typography for stretches of Chinese quoted in a predominantly Japanese
text to be written with "Japanese" glyphs.
Han Unification is designed to preserve legibility. Documents
typically can be simply displayed in the font preferred by the user. Where
a distinction in style needs to be made (for example, Chinese-style vs.
Japanese-style glyphs in the same document), appropriate fonts should be
applied to the specific text as needed.
Because of limitations in existing fonts, it may occasionally happen
that a rare kanji will be displayed using a Chinese-style glyph where
a Japanese-style glyph would be preferred. This is a font issue, not
a character encoding issue, and the same problem can occur with other
character encoding standards.
For more information, see
On the Encoding of Latin, Greek, Cyrillic, and Han.
[JJ]
Q: How can I recognize from the 32 bit value of
a Unicode character if this is a Chinese, Korean or Japanese character?
A: It's basically impossible and largely meaningless. It's the equivalent
of asking if "a" is an English letter or a French one. There are
some characters where one can guess based on the source information
in the Unihan Database that it's traditional Chinese, simplified Chinese,
Japanese, Korean, or Vietnamese, but there are too many exceptions to
make this really reliable.
The phonetic data in the Unihan Database should not be used for this purpose. A
blank in the phonetic data means that nobody's supplied a reading, not
that a reading doesn't exist. Because updating the Unihan Database is an
ongoing process, these fields will be increasingly filled out as time goes on,
but they should never be taken as absolutely complete. In particular, there are
obscure characters where it is known that there is a reading, but since the character does not occur in
standard dictionaries, we are unable to supply it (e.g., 䃟 U+40DF in
Cantonese).
A better solution is to look at the text as a whole: if there's a fair
amount of kana, it's probably Japanese, and if there's a fair amount of
hangul, it's probably Korean.
The only proper mechanism is, as for determining whether "chat" is
spelled correctly in English or French, is to use a higher-level
protocol.
[JJ]
Q: How does character input on a keyboard work for Chinese characters?
This is a complicated question. For answers, see
How are Chinese characters input?
Q: Why is Unicode missing some characters
from the Big Five character set?
A: The "Big Five" character set is an industrial standard
commonly used for traditional Chinese. There are, however, several
versions of the Big Five in common use, generally representing extensions
of the formal standard. There are two main versions, "plain Big Five" and
"ETEN Big Five" as well as numerous vendor- or platform- specific
extensions. In recent years, there have been further extensions such as
the Hong Kong Extension to Big Five and Big Five Plus.
The initial, un-extended Big Five was the standard version of
the character set at the time that the Unicode Standard, Version 1.0, was
under development, and Unicode was designed to cover its ideographic
repertoire completely. This is reflected in the data files supplied by the
Unicode Consortium. Some vendors provide vendor-specific tables showing
mapping data for their custom Big Five extensions and Unicode. The Unicode
Consortium does not, however, provide data on every known dialect of the
Big Five, so it is possible that a particular dialect of the Big Five is
not included in the tables provided by Unicode.
[JJ]
Q. I hear that certain characters from the GB18030 encoding are not mapped to any code points in Unicode,
and need to be mapped to characters in the Private Use Area instead. Is this true? And if so, is the issue being dealt with in the near future?
A. That used to be true, as of Unicode 4.0. There were in fact a small number of characters in GB 18030 that had not made it into
Unicode (and ISO/IEC 10646). However, to avoid having to map characters to the PUA for support of GB18030, the missing characters were added as of
Unicode 4.1, so of course they are also in Unicode 5.0 and later versions.
You can find the characters in question in Annex C (p. 92) of GB 18030-2000. All now have regular Unicode characters.
These can be found in the ranges: U+31C0..U+31CF (for CJK strokes) and U+9FA6..U+9FBB (for various CJK characters and components).
Q: Isn't it true that some Japanese can't write their own names in Unicode?
A: There are some situations where an individual prefers their
name be written with a specific glyph, as in the West we have John and Jon,
Mark and Marc, Cathy and Kathy. In most cases, variation sequences in the
UTS# 37 Unicode Ideographic Variation Database can be used to provide the required
representation in plain text. In other cases, the variant forms have been
encoded in Unicode as distinct characters. The IRG also may consider where
the encoding of new variant characters is justified.
It should be noted that this is not a problem of Han unification per
se, as it is often represented. Unicode is a superset of the major
Japanese character encoding standards. The various JIS standards and
ISO 2022-based encodings have the same limitation. [JJ]
Q: Where can I find a Unicode mapping for
EACC?
A: EACC is an American National Standard, East Asian
Character Code for Bibliographic Use (ANSI/NISO Z39.64), developed by the
library community. The Library of Congress specifies use of EACC for CJK
data in MARC 21 records that do not use UTF-8. The Unicode-EACC mapping
approved by the MARBI Committee of the American Library Association is
available on the
MARC 21
Web site. [JA]
Q: Why doesn't the Unihan database include
mappings for all EACC characters?
A: The Unihan database covers only the ideographs in the
Unicode Standard. EACC also includes characters such as Japanese kana and
Korean hangul that are outside the scope of the Unihan database.
[JA]
Q: What is JIS X0213?
A: JIS X0213, 7-bit and 8-bit double byte coded extended Kanji
sets for information interchange, is a new Japanese national standard coded
character set established by JISC (Japanese Industrial Standards Committee).
It was established in January 2000, then revised in February 2004. It
enumerates 11,233 characters, which extends the 4,344 characters of the
JIS X0208 standard. It consists of 10,050 Kanji (ideographic) characters
and 1,183 non-Kanji (non-ideographic) characters. These characters are
arranged in two planes of a 94-row-by-94-cell matrix. Also, as an informative
annex, three encoding methods are defined as extensions of existing de facto
encodings, that is, Shift JIS, EUC-JP, and ISO-2022-JP.
[TO]
Q: How is JIS X0213 related to some existing
JIS standards?
A: There are several JIS coded character set standards. JIS
X0201 is the single-byte coded character set which adapts the ISO/IEC 646
standard in Japan. JIS X0208, JIS X0212 and JIS X0213 are the double-byte
coded character sets, and JIS X0221 is the multi-byte coded character set
which corresponds to ISO/IEC 10646. JIS X0208 is the primary double-byte
coded character set used for Japanese. Although both JIS X0212 and JIS
X0213 Kanji standards have been established as the supplement to JIS X0208
standard, the scopes of their source character sets are different.
[TO]
Q: How is JIS X0213 related to Unicode / ISO/IEC
10646?
A: Almost all characters in JIS X0213 have corresponding
characters in Unicode / ISO/IEC 10646. Only a few non-Kanji characters are
represented by composite sequences in Unicode / ISO/IEC 10646. Kanji
characters are mapped to one of the blocks of CJK Unified Ideographs, CJK
Compatibility Ideographs, CJK Unified Ideographs Extension A, or CJK
Unified Ideographs Extension B in Unicode 4.0 (or later versions) and
corresponding versions of ISO/IEC 10646; or are mapped to CJK
Compatibility Ideographs. [TO]
Q: Where to get more information about JIS
X0213?
A: For more information about JIS X0213 standard, contact the
Japanese Standards
Association. [TO]
Q: I have heard there are problems in
Japanese and other East Asian mapping tables. Where can I find
information about these problems?
A: There are many well-known mapping problems and
discrepancies. For example:
Shift-JIS byte 0x5C can be mapped to U+005C or U+00A5, which are different,
unrelated characters with unrelated glyphs.
Shift-JIS bytes <0x81 0x5C> can be mapped to U+2014 or U+2015,
which look almost the same.
Shift-JIS bytes <0x87 0x82> and <0xFA 0x59> can
both be mapped to U+2116,
but the primary roundtrip mapping may be different between platforms.
That is, what U+2116 maps back to may be different.
Sometimes the standard is ill defined, and each vendor has
a choice in how to implement the Unicode mapping table. Examples include
the Big5-HKSCS and several other codepages. Sometimes the mapping table
varies, even on the same platform. For example, Windows-950 is either
Big5 or Big5-HKSCS, and the later one depends on the user applying a
Windows specific patch. Implementations of ISO 2022 encodings like
ISO-2022-JP differ not only in the mapping tables for the sub-encodings
but also in the supported sets of escape sequences and their invocation
pattern.
The W3C has an extensive technical report "XML
Japanese Profile" which lists a number of known mapping problems. Of
special interest to people with mapping problems are
Appendix
C, Ambiguities in conversion from Shift-JIS to Unicode and
Appendix D,
Ambiguities in conversion from Japanese EUC to Unicode.
The ICU project contains many mapping tables for a variety of
standards.See the ICU User Guide,
particularly the section on
Conversion Data. The page
Character Set
Mapping Tables shows a detailed comparison between a number of
different charsets, based on data collected on different platforms.
The obsolete, unmaintained
East
Asian Mapping Tables on the Unicode website also contain some notes
about specific discrepancies. There is an extensive article at Debian by
Tomohiro Kubota on these problems:
Conversion tables differ between vendors. The
article contains a table of discrepancies in various Japanese encodings.
For more information on character mappings and
roundtripping issues, see
UTS #22, Unicode Character
Mapping Markup Language.
[GR]
Q: Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs?
Wouldn't that save a large number of code points?
A: The Han ideographic script is largely compositional in nature. The overwhelming number of characters created over
the centuries (and still being coined) are made by adjoining two or more old characters in simple geometric relationships. For example,
the Cantonese-specific character U+55F0 嗰 was created by adjoining the two older characters, U+53E3 口 and U+500B 個, one next to the other.
The compositional nature of the script—and, more to the point, the fact that this compositional nature is well-known—means that over time tens of thousands of ideographs have been created, and these are currently encoded in Unicode by using one code
point per ideograph. The result is that over 80,000 code points are used for ideographs in the Unicode Standard—over two-thirds of
the characters encoded.
The compositional nature of the script makes it attractive to propose a compositional encoding model, such as can be used
for Hangul. Such a mechanism would result in the savings of thousands of code points and relieve the IRG from the burden of having to
examine potential candidates for encoding.
Unfortunately, there are some difficulties involved with a compositional model for Han.
First of all, while the rules for drawing composed Jamos as
Hangul syllables are relatively straightforward, those for Han
are surprisingly complex. To use U+55F0 嗰 as an example again, although it is built structurally out of two pieces, the left piece
occupies far less than 50% of the character's horizontal space. This reduction in size is a result of the nature of U+53E3 口 itself and
doesn't apply to other characters. Either the rendering process would have to be sophisticated enough to take such ideographic idiosyncrasies
into account, or the encoding model would have to provide more information than just the geometric relationship between the composing pieces.
(This is the main reason why the existing Ideographic Description Sequence mechanism is inadequate even for drawing described ideographs.)
Even more difficult is the problem of normalization, which would be necessary for operations such as comparison or searching.
A normalization algorithm would first have to parse the sequence of composing Han for validity, and then make sure that all substrings are
normalized. It should also to be able to recognize a "canonical" form for a sequence of composing Han. Thus, U+55F0 嗰 could be spelled
using three pieces (U+53E3 口, U+4EBB 亻, U+56FA 固) as well as with two. Indeed, since U+4EBB 亻 is a well-known variant form of U+4EBA
人, it could be spelled using that character, as well. Providing a canonical representation would have to take these multiple spellings into account.
The open-ended nature of the script and possibilities for ambiguous spelling make it virtually impossible to guarantee that two characters
made up by two different people would be treated as equivalent even if they look exactly the same and are intended to be equivalent.
Other computer processes such as machine-based translation or text-to- speech would probably have to skip such characters when they
occur in plain text, because there is no simple, authoritative way for these processes to be able to determine even approximate definitions or
pronunciations from the visual representation alone. Even if the data are available, the need to parse strings of variable length before looking
them up creates complications.
Finally, East Asian governments, while aware of the compositional nature of the script, do not wish to actively encourage the coining
of new forms because of the practical problems they create. In particular, new coinages are rarely an aid to communication, since they have no obvious
inherent meaning or pronunciation. They are little more than dingbats littering otherwise intelligible text.
While the number of encodable ideographs has proven far greater than Unicode had originally anticipated, the standard is in no danger
of running out of room for them any time soon. 80,000 ideographs encoded in 25 years amounts to just over 3,200 ideographs per year. At this rate,
it would take over 250 years to fill up the available space in Unicode with ideographs.
And while the number of unencoded but useful ideographs is larger than originally anticipated, it is also finite and probably smaller
than the number of ideographs already encoded. The bulk of useful unencoded forms is likely to come from placenames, personal names, or characters
needed for Chinese dialects other than Mandarin and Cantonese. Many unencoded forms occurring in existing texts are actually variants of encoded
characters and would best be represented as such.
While it currently takes several years for the IRG to fully process proposed ideographs so that they can be encoded, steps are being
taken to streamline this, and further steps will be possible in the future should they prove necessary. Indeed, the bulk of the work currently
done by the IRG would still have to be done for composed ideographs in order to provide support for them beyond rendering. [JJ]
Q: Why does Unicode use the term "ideograph" when it is linguistically incorrect?
A: The characters used to write Chinese are traditionally called "Chinese characters" in the various East Asian languages (hanzi in Mandarin, kanji in Japanese and hanja in Korean). In English, they are generally referred to by names such as "ideograph" or "pictogram," even though these don't accurately reflect what the characters are or how they are used. Indeed, no single linguistic term adequately describes these characters because they have such varied origins and uses. The only possible exception would be "sinogram," which is Latin for "Chinese character" and rarely found.
Unicode originally adopted the word "ideograph" as representing common English usage. The term is now so pervasive in the standard that it cannot be abandoned. [JJ]
Q: What is the difference between the Unicode character properties "Ideographic" and "Unified_Ideograph"?
A: The Unified_Ideograph property (short name: UIdeo) is used to specify the exact set of CJK Unified Ideographs in the standard. In other words, it is applied only to unified ideographs, and not to compatibility ideographs or other characters that behave like ideographs, and it is only used specifically for the CJK ideographs—characters in the Han script.
The Ideographic property (short name: Ideo), on the other hand, is extended to all CJK ideographs—not just the unified ones. It also applies to certain characters, such as U+3007 IDEOGRAPHIC NUMBER ZERO, which behave like CJK ideographs, even though they are not formally considered part of the CJK Unified Ideographs. Furthermore, the Ideographic property is not constrained to apply only to Han script characters.
Q: What's the best way to find out how to pronounce an ideograph in Mandarin?
A: Most Chinese characters have only one pronunciation in any given variety of spoken Chinese; and roughly half of those with multiple pronunciations have only one in general use. Exceptions are common enough, however, that text-to-speech engines dealing with extended runs of text had best take semantics and context into account when determining the correct pronunciations.
In situations where sufficient context is unavailable, the best pronunciation to use with Mandarin is the one indicated with the kMandarin field. This is derived algorithmically from the kHanyuPinlu, kXHC1983, and kHanyuPinyin fields, with corrections and additions hand-supplied by experts in China. Where the kMandarin field provides multiple readings, the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). [JJ]
Q: How does this relate to the pinyin ordering and transliteration in CLDR?
A: The kMandarin field is used for pinyin ordering and transliteration in CLDR.
Q: Why are DPRK (North Korean == kIRG_KPSource) glyphs missing from some CJK code charts?
A: A font is not currently available for representation of kIRG_KPSource glyphs in the main CJK Unified Ideographs block, Ext A, and Ext B. It is possible that KP-Source glyphs will appear in future code charts, if suitable font data becomes available. [RC]