Basic Questions
Q: What is Unicode?
A: Unicode is the universal character encoding, maintained by
the Unicode Consortium.
This encoding standard provides the basis for processing, storage and interchange of text data in any language in
all modern software and information technology protocols. See "What is Unicode?" for a short explanation of
what Unicode is all about. That page is translated into more than 50 languages,
to illustrate the use of the standard. See for yourself!
Q: What is the scope of Unicode?
A: Unicode covers all the characters for all the writing
systems of the world, modern and ancient. It also includes technical
symbols, punctuations, and many other characters used in writing text.
The Unicode Standard is intended to support the needs of all types of
users, whether in business or academia, using mainstream or minority
scripts.
Q: How many languages are covered by
Unicode?
A: It's hard to say, because Unicode encodes scripts for
languages, rather than languages per se. Many scripts (especially the
Latin script) are used to write a large number of languages. The easiest
answer is that Unicode covers all of the languages that can be written in
the following scripts:
Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari,
Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic,
Cherokee, Canadian Aboriginal Syllabics, Khmer, Mongolian,
Han (Japanese, Chinese, Korean ideographs), Hiragana, Katakana, and Yi.
Unicode also includes many historic scripts used to write long-dead
languages, as well as lesser-used regional scripts that may be used as a
second (or even third) way to write a particular language. See
Supported Scripts
for the full list. See also the list of
Languages and Scripts.
[MD] &
[KW]
Q: Does Unicode encode scripts or languages?
A: The Unicode Standard encodes characters on a per script
basis. So, for example, there is only one set of Latin characters defined,
despite the fact that the Latin script is used for the alphabets of
thousands of different languages. The same principle applies for any other
script (Cyrillic, Arabic, Ethiopic, Devanagari, ...) which is used for
writing many different languages. However, the Unicode Standard does not
encode scripts per se. For a listing of script names, see
UTR #24,
Unicode Script Property. For the ISO standard for script codes, see
ISO/IEC 15924,
Code for the Representation of Names of Scripts. For the ISO standard for
language codes, see ISO 639, Code for the Representation of Names of
Languages. [KW]
Q: Why does Unicode unify Chinese, Japanese, and Korean ideographs, but not unify the Latin, Greek, and Cyrillic alphabets?
A: For a detailed answer, see
http://www.unicode.org/notes/tn26/.
Q: What's the connection between Unicode and the International Standard,
ISO/IEC 10646?
A: Both 10646 and Unicode specify the same character encoding: they
contain the same characters at the same locations. They remain fully
synchronized even as they are extended to cover additional characters. See the
Unicode and ISO
10646 FAQ and
Appendix
C of the Unicode Standard for a more extensive explanation of their
relationship.
Q: I think my company might want to get involved in Unicode. Is there any material that I can use to present the case to my management?
A: Yes, there is a white paper outlining the overall value proposition of a Unicode membership to an organization.
See Why Join
and How to Join.
Q: Where can I purchase the Unicode software or the Unicode font?
A: The Unicode Standard is not a software program, nor is
it a font. It is a character encoding system, like ASCII, designed
to help developers who want to create software applications that work in
any language in the world.
If all you need is to create a multilingual text or write a
document or send e-mail in another language, then a Unicode-compliant text
editor, mail program, or word processing package will do the job. Please see the following pages on our web site for further information
about the standard and where to look for help:
If all you need is to create a multilingual text or write a
document or send e-mail in another language, then a Unicode-compliant text
editor, mail program, or word processing package will do the job. Please see the following pages on our web site for further information
about the standard and where to look for help:
If you are a developer starting to learn about using Unicode,
you should get a copy of the
latest version of the
Unicode Standard to find
out more about Unicode. In addition to the pages listed above, please see:
Q: I can't find my character in Unicode.
Where do I look?
A: Look at "Where
is my character?"
Q: Where can I find resources to help me with Unicode?
A: Here's a short table that suggests links to information that can answer typical questions.
[MD]
Q: What does Unicode conformance require?
A: Chapter 3, Conformance discusses this in detail. Here's a very informal
version:
-
Unicode characters don't fit in 8 bits; deal with it.
-
2 Byte order is only an issue in I/O.
-
If you don't know, assume big-endian.
-
Loose surrogates have no meaning.
-
Neither do U+FFFE and U+FFFF.
-
Leave the unassigned codepoints alone.
-
It's OK to be ignorant about a character, but not plain
wrong.
-
Subsets are strictly up to you.
-
Canonical equivalence matters.
-
Don't garble what you don't understand.
-
Process UTF-* by the book.
-
Ignore illegal encodings.
-
Right-to-left scripts have to go by bidi rules.
[JC]
Q: Can applications simply use unassigned
characters as they wish?
A: No! No conformant Unicode implementation can use the
un-encoded values outside of the private use area.
Only the values in the private use areas (E000..F8FF,
F0000..FFFFD, and 100000..10FFFD) are legal for private assignment.
However, this is over 137,000 code points, which should be more than
ample for the vast majority of implementations.
[MD]
Q: Are surrogate characters the same as
supplementary characters?
A: This question shows a common confusion. It is very
important to distinguish surrogate code points (in the range
U+D800..U+DFFF) from supplementary code points (in the completely
different range, U+10000..U+10FFFF). Surrogate code points are reserved
for use, in pairs, in representing supplementary code points in UTF-16.
There are supplementary characters (i.e. encoded characters
represented with a single supplementary code point), but there are not and
will never be surrogate characters (i.e. encoded characters represented
with a single surrogate code point).
[MD]
Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is what a Unicode implementation was up to Unicode
1.1, before surrogate code points and UTF-16 were added as concepts to
Version 2.0 of the standard. This term should be now be avoided.
When interpreting what people have meant by "UCS-2" in past
usage, it is best thought of as not a data format, but as an indication
that an implementation does not interpret any supplementary characters. In
particular, for the purposes of data exchange, UCS-2 and UTF-16 are
identical formats. Both are 16-bit, and have exactly the same code unit
representation.
The effective difference between UCS-2 and UTF-16 lies at a
different level, when one is interpreting a sequence code units as code
points or as characters. In that case, a UCS-2 implementation would not
handle processing like character properties, codepoint boundaries,
collation, etc. for supplementary characters.
[MD] &
[KW]
Q: What can I do if I think there is an error in the Unicode Standard or other specification?
A: Request a correction, clarification or change to the relevant specification by submitting feedback or a formal proposal to the
corresponding technical committee (UTC or
CLDR-TC). See
Public Review Issues for an explanation of how to do this. (The methods
are different for the two committees and the type of change requested.)