[Unicode]  Frequently Asked Questions Home | Site Map | Search

Basic Questions

Q: What is Unicode?

A: Unicode is the universal character encoding, maintained by the Unicode Consortium. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols. See "What is Unicode?" for a short explanation of what Unicode is all about. That page is translated into more than 50 languages, to illustrate the use of the standard. See for yourself!

Q: What is the scope of Unicode?

A: Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text. The Unicode Standard is intended to support the needs of all types of users, whether in business or academia, using mainstream or minority scripts.

Q: How many languages are covered by Unicode?

A: It's hard to say, because Unicode encodes scripts for languages, rather than languages per se. Many scripts (especially the Latin script) are used to write a large number of languages. The easiest answer is that Unicode covers all of the languages that can be written in the following scripts: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic, Cherokee, Canadian Aboriginal Syllabics, Khmer, Mongolian, Han (Japanese, Chinese, Korean ideographs), Hiragana, Katakana, and Yi. Unicode also includes many historic scripts used to write long-dead languages, as well as lesser-used regional scripts that may be used as a second (or even third) way to write a particular language. See Supported Scripts for the full list. See also the list of Languages and Scripts. [MD] & [KW]

Q: Does Unicode encode scripts or languages?

A: The Unicode Standard encodes characters on a per script basis. So, for example, there is only one set of Latin characters defined, despite the fact that the Latin script is used for the alphabets of thousands of different languages. The same principle applies for any other script (Cyrillic, Arabic, Ethiopic, Devanagari, ...) which is used for writing many different languages. However, the Unicode Standard does not encode scripts per se. For a listing of script names, see UTR #24, Unicode Script Property. For the ISO standard for script codes, see ISO/IEC 15924, Code for the Representation of Names of Scripts. For the ISO standard for language codes, see ISO 639, Code for the Representation of Names of Languages. [KW]

Q: Why does Unicode unify Chinese, Japanese, and Korean ideographs, but not unify the Latin, Greek, and Cyrillic alphabets?

A: For a detailed answer, see http://www.unicode.org/notes/tn26/.

Q: What's the connection between Unicode and the International Standard, ISO/IEC 10646?

A: Both 10646 and Unicode specify the same character encoding: they contain the same characters at the same locations. They remain fully synchronized even as they are extended to cover additional characters. See the Unicode and ISO 10646 FAQ and Appendix C of the Unicode Standard for a more extensive explanation of their relationship.

Q: I think my company might want to get involved in Unicode. Is there any material that I can use to present the case to my management?

A: Yes, there is a white paper outlining the overall value proposition of a Unicode membership to an organization. See Why Join and How to Join.

Q: Where can I purchase the Unicode software or the Unicode font?

A: The Unicode Standard is not a software program, nor is it a font. It is a character encoding system, like ASCII, designed to help developers who want to create software applications that work in any language in the world.

If all you need is to create a multilingual text or write a document or send e-mail in another language, then a Unicode-compliant text editor, mail program, or word processing package will do the job. Please see the following pages on our web site for further information about the standard and where to look for help:

If all you need is to create a multilingual text or write a document or send e-mail in another language, then a Unicode-compliant text editor, mail program, or word processing package will do the job. Please see the following pages on our web site for further information about the standard and where to look for help:

If you are a developer starting to learn about using Unicode, you should get a copy of the latest version of the Unicode Standard to find out more about Unicode. In addition to the pages listed above, please see:

Q: I can't find my character in Unicode. Where do I look?

A: Look at "Where is my character?"

Q: Where can I find resources to help me with Unicode?

A: Here's a short table that suggests links to information that can answer typical questions.

Question

Reference

  • What is in each particular version of Unicode?

  • What is in the latest version of Unicode?

Versions of the Unicode Standard

Enumerated Versions

  • What is the meaning of a special term?
Unicode Glossary or Terminology for translations of terms
  • Where can I find code libraries, commercial or open-source, for the following?

    • character conversion

    • collation

    • date, time, number, and message formatting

    • normalization

    • and the other features mentioned under "What level of support should I look for?"

Unicode Resources page, specifically the tab on Internationalization Libraries

  • What should regular expressions do with Unicode?

  • Can I transmit Unicode text on EBCDIC systems?

  • How should a word-processor break lines in Unicode text?

  • Are there ways to normalize Unicode text?

  • For the Far East, how do I decide which characters should use wide glyphs and which ones narrow?

  • How should I sort Unicode text?

  • Is there an update to the BIDI algorithm?

  • How can I compress Unicode text?

Unicode Technical Reports, also

FAQ: Specifications

  • I want to get online data for implementing Unicode. Where can I find data for:

    • Character properties?

    • Upper/lower/titlecasing?

    • Decompositions?

    • Normalization?

    • Conversion to other character encodings?

    • Code for Kanji code conversion with compressed tables?

Online Data

  • Are there conferences or seminars where we can find out more about Unicode?

Unicode Conferences

  • Who are the current members of the consortium?

  • I am interested in joining the consortium. Where can I find out more?

Membership Information

Our Members

[MD]

Q: What does Unicode conformance require?

A: Chapter 3, Conformance discusses this in detail. Here's a very informal version:

  • Unicode characters don't fit in 8 bits; deal with it.

  • 2 Byte order is only an issue in I/O.

  • If you don't know, assume big-endian.

  • Loose surrogates have no meaning.

  • Neither do U+FFFE and U+FFFF.

  • Leave the unassigned codepoints alone.

  • It's OK to be ignorant about a character, but not plain wrong.

  • Subsets are strictly up to you.

  • Canonical equivalence matters.

  • Don't garble what you don't understand.

  • Process UTF-* by the book.

  • Ignore illegal encodings.

  • Right-to-left scripts have to go by bidi rules. [JC]

Q: Can applications simply use unassigned characters as they wish?

A: No! No conformant Unicode implementation can use the un-encoded values outside of the private use area.

Only the values in the private use areas (E000..F8FF, F0000..FFFFD, and 100000..10FFFD) are legal for private assignment. However, this is over 137,000 code points, which should be more than ample for the vast majority of implementations. [MD]

Q: Are surrogate characters the same as supplementary characters?

A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.

There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate code point). [MD]

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is what a Unicode implementation was up to Unicode 1.1, before surrogate code points and UTF-16 were added as concepts to Version 2.0 of the standard. This term should be now be avoided.

When interpreting what people have meant by "UCS-2" in past usage, it is best thought of as not a data format, but as an indication that an implementation does not interpret any supplementary characters. In particular, for the purposes of data exchange, UCS-2 and UTF-16 are identical formats. Both are 16-bit, and have exactly the same code unit representation.

The effective difference between UCS-2 and UTF-16 lies at a different level, when one is interpreting a sequence code units as code points or as characters. In that case, a UCS-2 implementation would not handle processing like character properties, codepoint boundaries, collation, etc. for supplementary characters. [MD] & [KW]

Q: What can I do if I think there is an error in the Unicode Standard or other specification?

A: Request a correction, clarification or change to the relevant specification by submitting feedback or a formal proposal to the corresponding technical committee (UTC or CLDR-TC). See Public Review Issues for an explanation of how to do this. (The methods are different for the two committees and the type of change requested.)


Access to Copyright and terms of use