[Unicode]  Frequently Asked Questions Home | Site Map | Search

Basic Questions

Q: What is Unicode?

A: Unicode is the universal character encoding, maintained by the Unicode Consortium. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols. See "What is Unicode?" for a short explanation of what Unicode is all about. That page is translated into more than 50 languages, to illustrate the use of the standard. See for yourself!

Q: What is the scope of Unicode?

A: Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text. The Unicode Standard is intended to support the needs of all types of users, whether in business or academia, using mainstream or minority scripts.

Q: How many languages are covered by Unicode?

A: It's hard to say, because Unicode encodes scripts for languages, rather than languages per se. Many scripts (especially the Latin script) are used to write a large number of languages. The easiest answer is that Unicode covers all of the languages that can be written in the following scripts: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic, Cherokee, Canadian Aboriginal Syllabics, Khmer, Mongolian, Han (Japanese, Chinese, Korean ideographs), Hiragana, Katakana, and Yi. Unicode also includes many historic scripts used to write long-dead languages, as well as lesser-used regional scripts that may be used as a second (or even third) way to write a particular language. See Supported Scripts for the full list. See also the list of Languages and Scripts.

Q: Does Unicode encode scripts or languages?

A: The Unicode Standard encodes characters on a per script basis. So, for example, there is only one set of Latin characters defined, despite the fact that the Latin script is used for the alphabets of thousands of different languages. The same principle applies for any other script (Cyrillic, Arabic, Ethiopic, Devanagari, ...), which is used for writing many different languages. However, the Unicode Standard does not encode scripts per se. For a listing of script names, see UTR #24, Unicode Script Property. For the ISO standard for script codes, see ISO/IEC 15924, Code for the Representation of Names of Scripts. For the ISO standard for language codes, see ISO 639, Code for the Representation of Names of Languages.

Q: Why does Unicode unify Chinese, Japanese, and Korean ideographs, but not unify the Latin, Greek, and Cyrillic alphabets?

A: For a detailed answer, see http://www.unicode.org/notes/tn26/.

Q: What's the connection between Unicode and the International Standard, ISO/IEC 10646?

A: Both 10646 and Unicode specify the same character encoding: they contain the same characters at the same locations. They remain fully synchronized even as they are extended to cover additional characters. See the Unicode and ISO 10646 FAQ and Appendix C of the Unicode Standard for a more extensive explanation of their relationship.

Q: I think my company might want to get involved in Unicode. Is there any material that I can use to present the case to my management?

A: Yes, there is a white paper outlining the overall value proposition of a Unicode membership to an organization. See Why Join and How to Join.

Q: Where can I purchase the Unicode software or the Unicode font?

A: The Unicode Standard is not a software program, nor is it a font. It is a character encoding system, like ASCII, designed to help developers who want to create software applications that work in any language in the world.

If all you need is to create a multilingual text or write a document or send e-mail in another language, then a Unicode-compliant text editor, mail program, or word processing package will do the job. Please see the following pages on our web site for further information about the standard and where to look for help:

If you are a developer starting to learn about using Unicode, you should read the latest version of the Unicode Standard to find out more about Unicode. In addition to the pages listed above, please see:

Q: My computer cannot display some of the latest Unicode symbols I need. I tried downloading and extracting the latest Unicode data files from the Unicode web site, but it has no effect on the characters my computer can display or type. How can I display and type the latest Unicode characters?

A: The Unicode data files do not function like a software patch, and cannot automatically update existing fonts or applications, so downloading the files will not help in displaying and typing the Unicode characters needed. The reason you don't see the characters as expected is most likely because you need to install a font that covers the set of Unicode characters you are trying to see. Other possible reasons might be that:

  • your operating system needs to be updated (older operating systems such as Windows XP, which came out in 2001, don't provide expected support for some new characters)

  • your application doesn't support Unicode properly (though most do)

If you need to install a font to resolve the problem, free fonts can be downloaded for many Unicode ranges. See Font Resources, or search in your browser for the name of the font you need. Fonts typically cover only one script, or sometimes a range of scripts. Often fonts haven't been updated to render the most recent additions to the Unicode character set. See also Display Problems?

Q: I can't find my character in Unicode. Where do I look?

A: Look at "Where is my Character?"

Q: Where do I find information on the use of characters for a given writing system or script?

A: The block introductions found in Chapters 7 through 20 of the Unicode Standard are a good place to start. Another place to look is the comments contained in the names lists, which accompanies the code charts, although the comments are not intended to be encyclopedic. The data files in the Unicode Character Database provide information, often in machine-readable form, on character properties, linebreaking, wordbreaking, and so on.

Q: Are script descriptions in the block introductions complete?

A: No. They cover the information necessary to define the encoded characters, but issues such as usage conventions, layout behavior and glyph design are usually covered only as far as needed to help establish the identify of an encoded character.

Q: Where do I go to find more information about characters for a given script?

Consult the bibliography in the References section of the Unicode Standard (section R.3) Also check the original proposals to encode the scripts. Those are the documents in which the characters were proposed for encoding. While the proposals are not authoritative and do not have any formal status, they were used in the process of committee deliberation. They often contain useful information, including examples or lists of references.

Q: Where do I find script proposals for a specific script? 

Most proposals are available in the UTC Document Registry. You can also search for specific topics on the Unicode website to find proposals. Many proposals are also available on the JTC 1/SC2/WG2 website. Individually maintained websites may also include links to particular script proposals.

Q: Where can I find resources to help me with Unicode?

A: Here's a short table that suggests links to information that can answer typical questions.

Question

Reference

  • What is in each particular version of Unicode?

  • What is in the latest version of Unicode?

Versions of the Unicode Standard

Enumerated Versions

  • What is the meaning of a special term?

Unicode Glossary or Terminology for translations of terms

  • Where can I find code libraries, commercial or open-source, for the following?

    • character conversion

    • collation

    • date, time, number, and message formatting

    • normalization

    • and the other features mentioned under "What level of support should I look for?"

Unicode Resources page, specifically the tab on Internationalization Libraries

  • What should regular expressions do with Unicode?

  • Can I transmit Unicode text on EBCDIC systems?

  • How should a word-processor break lines in Unicode text?

  • Are there ways to normalize Unicode text?

  • For the Far East, how do I decide which characters should use wide glyphs and which ones narrow?

  • How should I sort Unicode text?

  • Is there an update to the BIDI algorithm?

  • How can I compress Unicode text?

Unicode Technical Reports, also

Specifications FAQ

  • I want to get online data for implementing Unicode. Where can I find data for:

    • Character properties?

    • Upper/lower/titlecasing?

    • Decompositions?

    • Normalization?

    • Conversion to other character encodings?

    • Code for Kanji code conversion with compressed tables?

Online Data

  • Are there conferences or seminars where we can find out more about Unicode?

Unicode Conferences

  • Who are the current members of the Consortium?

  • I am interested in joining the Consortium. Where can I find out more?

Membership Information

Our Members

Q: What does Unicode conformance require?

A: Chapter 3, Conformance discusses this in detail. Here's a very informal version:

  • Unicode characters don't fit in 8 bits; deal with it.

  • 2 Byte order is only an issue in I/O.

  • If you don't know, assume big-endian.

  • Loose surrogates have no meaning.

  • Neither do U+FFFE and U+FFFF.

  • Leave the unassigned codepoints alone.

  • It's OK to be ignorant about a character, but not plain wrong.

  • Subsets are strictly up to you.

  • Canonical equivalence matters.

  • Don't garble what you don't understand.

  • Process UTF-* by the book.

  • Ignore illegal encodings.

  • Right-to-left scripts have to go by bidi rules. [JC]

Q: Can applications simply use unassigned characters as they wish?

A: No! No conformant Unicode implementation can use the un-encoded values outside of the private use area.

Only the values in the private use areas (U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..1U+0FFFD) are legal for private assignment. However, this is over 137,000 code points, which should be more than ample for the vast majority of implementations.

Q: Are surrogate characters the same as supplementary characters?

A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.

There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate code point).

Q: What can I do if I think there is an error in the Unicode Standard or other specification?

A: Request a correction, clarification or change to the relevant specification by submitting feedback or a formal proposal to the corresponding technical committee (UTC or CLDR-TC). See Public Review Issues for an explanation of how to do this. (The methods are different for the two committees and the type of change requested.)


Access to Copyright and terms of use