[Unicode]  Frequently Asked Questions Home | Site Map | Search

Fonts and Keyboards

Fonts and Unicode

Glyph Variations

Input by Hexadecimal Code

Inputting Chinese Characters

Fonts and Unicode

Q: Is Unicode a font?

A: No. Unicode is not a font. See Basic Questions. However, fonts are built to use the Unicode Standard.

Q: Can I use Unicode characters without asking for permission from the Unicode Consortium?

A: You don't need any special licensing or permission to use Unicode characters. This includes using them in products, in data, or in an any other context.

Q: The Unicode Standard is copyrighted. Does this mean that you have the copyright on my script?

A: No. The text of the Unicode Standard is copyrighted, but not the characters or writing systems.

Q: Can I get the glyphs for your characters and use them? For example, using fonts from your charts?

A: No. You cannot extract the glyphs from the PDF code charts and use them in products. The fonts used in the PDF code charts on our website are licensed by their owners for chart usage only, and they may not be re-used without permission of the font suppliers. Please see http://www.unicode.org/charts/fonts.html for a list of suppliers.

Q: Then how can I get glyphs for the character I need?

A: You can design your own fonts, you can purchase the license for a font designed by someone else, or you can search the web for the many fonts which have been placed in the public domain or which have free licenses. For help with font resources, please see Fonts. Or you can contact the font vendors who contributed to the production of our code charts, listed on our font supplier page: Font Contributors Acknowledgements.

Q: I'm a software developer. Is there anything else I need to know about terms of use before using Unicode characters?

A: Before using any part of the standard, you should read all of our documentation and the Unicode Terms of Use. If you are interested in using the code charts, please see Character Code Charts Help and Links and the terms of use found on the first page of each of the code chart files.

Q: What is a Unicode-conformant font?

A: A font is never used in isolation: it is one of the components used in text rendering systems. Therefore, it is not strictly meaningful to ask if a font is Unicode-conformant; this question is more pertinent for the rendering system as a whole.

Nevertheless, most rendering systems involve some kind of mapping from characters to glyphs, stored in fonts. In sfnt-based fonts, such as TrueType, OpenType and Graphite fonts, default glyph mappings are stored in the 'cmap' table; additional tables may substitute alternate glyphs based on context. A Unicode-conformant font can be defined as a font which contains a mapping from Unicode characters and that maps characters to glyphs in a way that is consistent with character semantics defined in the Unicode Standard.

For example, a font that includes a character-to-glyph mapping based only on the JIS (Japanese Industrial Standard) character encoding would not be Unicode compliant. (Note, however, that such a font potentially may be used within a text rendering system that can handle conversions between legacy encodings and Unicode to display text in a Unicode-conformant way.) For another example, a TrueType font that includes a Windows Unicode 'cmap' table but that maps characters in the Latin-1 block to glyphs for Cyrillic characters is not a Unicode-conformant font.

Note that the Unicode Consortium does not review or evaluate fonts for their compliance to the Unicode Standard.

The best place to find information about Unicode-compliant fonts is  our Unicode Resources page. You might also want to check the Display Problems? page for guidance on installing fonts and using them with a browser. [EM] & [DA]

Q: What factors influence how I can display characters in Java applications?

A: Displaying Unicode correctly in Java is dependent on 3 factors:

1. physical fonts
2. composite fonts in the font.properties file
3. Swing and AWT components.

Fonts store glyphs. You must have an appropriate font containing the glyphs for the character that you want to display. You can use a physical font name or a virtual “composite” font name in your text components.

Composite fonts map a logical font name to physical fonts on your system. when you set the font on a text component, you can use either a physical font name or a composite font name. If you use a composite font name, you must make sure that the composite font is correctly configured in your font.properties file. This file maps a composite or logical font name to one or more physical fonts. At least one of the physical fonts in the mapping must contain the appropriate glyphs for the characters you want to display.

AWT components first convert the Unicode characters to the host's native character set encoding. if the target character set does not have the needed Unicode character, a substitute character is often used to represent the original character. AWT components are not typically flexible enough to display wide ranges of multilingual text because of their dependence on a single, rather limited charset or codepage.

On the other hand, Swing components do not suffer from the same limitations as AWT components. because Swing components do not convert a Unicode character to the host's native charset or codepage, these components can typically display a wide range of multilingual text. [JO]

Q: How can I make an OpenType font?

A: The following are some pointers for creating OpenType fonts:

  1. http://www.microsoft.com/typography/tt/tt.htm
    This has links to the OpenType specification, as well as the specification to create Arabic and Indic script fonts.

  2. http://www.microsoft.com/typography/developers/volt/default.htm
    This has resources for using the Visual OpenType Layout Tool (VOLT), which can be used to add layout tables to fonts. You might want to join the VOLT users community listed there. Many members of this community are developing OpenType fonts.

  3. http://www.microsoft.com/typography/tools/vtt.htm
    Visual TrueType (VTT), a tool to add hints to fonts containing TrueType outlines is available at this url. This url has a link to additional VTT resources.

  4. http://www.microsoft.com/typography/otspec/otlist.htm
    This contains information about the OpenType discussion forum.

  5. http://partners.adobe.com/asn/tech/type/otfdk/index.jsp
    The Adobe Font Development Kit for OpenType contains a set of tools used by Adobe font developers for wrapping up PostScript® fonts as OpenType/CFF font files, and adding OpenType layout features. [AJ] & [EM]

Q: How can I make a AAT fonts?

A: http://developer.apple.com/fonts/AddingAAT/AddingAAT.html is a basic tutorial on adding AAT (Apple Advanced Typography) support to a font. Tools to do this are available at http://developer.apple.com/fonts/Tools/index.html. [JJ]

Q: How can I make a Graphite font?

A: Graphite fonts are TrueType fonts with supplemental Graphite tables added. A Graphite font is created by writing a description of the script behavior (the character-to-glyph transformations) using the Graphite Description Language (GDL), and compiling that into the TrueType font.
The following are helpful links:

  1. http://scripts.sil.org/cms/scripts/page.php?site_id=projects&item_id=graphite_home
    This contains general information related to Graphite with links documentation, mail lists, open source code.

  2. http://scripts.sil.org/cms/scripts/page.php?site_id=projects&item_id=graphite_devFont
    This provides a detailed discussion of the Graphite Description Language.

  3. http://scripts.sil.org/GraphiteCompilerDownload
    This provides a link to a downloadable software package (Windows) containing the GDL compiler for creating Graphite-enabled fonts.

  4.  http://scripts.sil.org/cms/scripts/page.php?site_id=projects&item_id=graphite_apps
    This provides links to a list of applications that support Graphite rendering in Graphite-enabled fonts.[PC]

Glyph Variations

Q: There seems to be a lot of variation in the glyphs for some characters. As a font maker, I want to know the acceptable range of glyphs for some common cases. Where can I go?

A: One place to start is the Microsoft Typography web site. Some of the questions and answers below may also give you an idea of the range of allowable variations. If you scroll down, there is a table of variations to which several of these questions refer.

Q: Are the glyphs in the Unicode Standard normative?

A: No. See for example row 9 of the accompanying table (below) showing two glyphs for “numero”. Sometimes, the shape depends on the posture of the font. For example, the letters “a” and “g” as shown in rows 11 and 12 12 of the table. Common variations may be seen in italic and sans-serif fonts. The “y with hook” letter U+01B3, U+01B4 has two common variations as shown in row 13 of the table. Some fonts show the curl on one side for capitals and the other for small letters; some fonts have the curls on the same side.

Q: Does a font have to show the same glyphs as in the standard?

A: No. There are several examples of acceptable glyphs in the table, such as rows 9 and 10. The upsilon sometimes has straight arms, sometimes curly arms, depending on font design.

Q: Can the shapes of diacritical marks move around and still mean the same thing?

A: Yes, sometimes. If you look at the variations on lower-case “g” in the table (row 1), you can see that the accent moves in different ways depending on language or orthography.

Q: What about letters with commas and cedillas?

A: Some languages preferentially use commas to cedillas, or vice versa, as in rows 2 and 3 of the table. Many times, these are encoded by one pre-composed character in the standard, which may be displayed with various glyphs. However, for compatibility and legacy reasons, some such variations are encoded as separate characters.

Q: How about haceks and apostrophes; are those variants of each other? And what is a caron anyway?

A: An apostrophe above and to the right is a common variation for the hacek (caron) on some letters such as “d” and “t”, as shown in rows 4, 5, 6, 7 of the table. (“Caron” is just standardese for “hacek”, and there is another FAQ about that word.)

Q: What about Han characters? Are the CJK glyphs in the Unicode Standard normative?

A: This is a deep and complicated subject, and there is a separate FAQ page on Han and CJK issues. There are some variations in Han characters that are merely stylistic, others that are encoded. For example, the ideograph for “bone” in row 14 of the table has two common variants.

Strictly speaking, the identity of a character in Unihan is not established by the representative glyph appearing in the Unicode code charts, but by its source mappings in the Unihan Database. Designers interested in creating a CJK font for any given locale must consider the Unicode code chart glyph in the context of the Unihan Database mappings relevant to their specific locale.

The representative unified glyph appearing in a Unihan code chart is determined in the encoding process, based on the submitted source glyphs and their associated mappings. (Recent versions of the code charts show multiple, locale-specific representative glyphs). The characteristic features of a representative unified glyph such as it s stroke types, stroke count, and certain other features make it distinct in the encoding model used in the encoding process. The source glyphs behind the unified glyph, that is, the bitmaps (derivative of specific print sources) contributed by IRG members may or may not agree with the unified glyph in terms of stroke count, stroke types, fine positioning of strokes and components, and in fact source glyphs often do not harmonize with each other stylistically at all.

CJK unification is possible (and largely practical) because abstract distinctive features (and assemblages of distinctive features) for Han ideographs are seen as common across locales (sources). This does not mean that all features are shared or distinctive in all locales. Font developers may decide to treat some Unihan distinctions as non-distinctive for their specific purpose. Just as developers must determine (on the basis of the Unihan Database mappings) which code points are suitable for inclusion in their typefaces, so too they are free to choose something like one of the explicitly unified glyphs for their typeface (on the basis of the relevant source mappings), or something else altogether (hopefully within reason).

Q: Where can I read more about the topic of glyph variations?

A: Glyph variations for the Latin script are discussed in Section 7.1, Latin of the Unicode Standard. Glyph variations for the Han script are discussed in Section 12.1, Han. For character/glyph relations, see also UTR #17, Unicode Character Encoding Model, and a paper posted on the Apple developer site on the Character Glyph Model by John Jenkins. Glyph variations in mathematical context are discussed in UTR #25, Unicode Support for Mathematics. See also the Variation Sequences FAQ.

Q: What are some examples of the possible range of glyph variations?

A: See the table below. Several questions above refer to the glyphs depicted in the table.

Examples of Glyph Variations

 

Codepoint

Some Acceptable Glyphs

Comments

1 U+0123 The rotated comma above is used in Latvian typography to avoid an extra long descent.
2 U+0162 The comma below or cedilla are common variants on many letters in various languages.
3 U+0163
4 U+010F Apostrophe or hacek (caron) are common variants on many letters in various languages.
5 U+0165
6 U+013D
7 U+013E
8 U+03A5 Greek capital upsilon can have straight or curved arms, sometimes with a curl.
9 U+2116 The placement of the “o” in numero can vary; and sometime it has no underline.
10 U+00BC Glyphs for vulgar fractions may have slanted bars, or horizontal bars.
11 U+0061 The right hand form is commonly seen in italic and sans-serif fonts.
12 U+0067 The right hand form is commonly seen in sans-serif fonts.
13 U+01B3, U+01B4 Y with Hook can have the hook on the left or right. The examples are, left to right: Gentium, Lucida Sans Unicode, and Code2000
14 U+9AA8  Variations in Han ideographs are complex, and this is one example of thousands.

Character Input by Hexadecimal Code

Q: How can I input any Unicode character if I know its hexadecimal code?

A: Some platforms have methods of hexadecimal entry; others have only decimal entry.

On Windows, there is a decimal input method: hold down the alt key while typing decimal digits on the numeric keypad. The ALT+decimal method requires the code from the encoding of the command prompt. To enter Unicode decimal values, you have to prefix the number with a 0 (zero). E.g. ALT+0163 is the pound sign (“£”), in decimal.

There is a hex-to-Unicode entry method that works with WordPad 2000, Office 2000 edit boxes, RichEdit controls in general, and in Microsoft Word 2002. To use it, type a character´s hexadecimal code (in ASCII), making corrections if needed, and then type Alt+x after it; in some program versions, however, such as MS Word (German), you must rather type Alt+c after it. The hexadecimal code is replaced by the corresponding Unicode character. The Alt+x (or Alt+c, respectively) can be a toggle (as in the Microsoft Office XP). That is, type it once to convert the hex code to a character and type it again to convert the character back to a hex code. If the hex code is preceded by one or more hexadecimal digits, you will need to “select” the code so that the preceding hexadecimal characters aren't included in the code. The code can range up to the value 0x10FFFF (which is the highest character in the 17 planes of Unicode).

Recent versions of Windows also ship with the “NeiMa” input method for the Simplified Chinese language; this IME support the input of Unicode characters via their scalar value expressed as four hexadecimal digits (and it therefore limited to BMP characters). However, using this input method may have the undesirable side-effect of tagging your text as “Simplified Chinese”, even if you use non-Chinese characters.

On the Macintosh with OS X, after activating the Hex input method, simply hold down the option key when typing the codes. After each fourth one, you get the character inserted in the document, and in newer software, the “Last Resort” font will be used if there is no regular font available for the character.

On Mac OS X 10.2 or later, there is a Unicode character palette, which lets you click on and insert any Unicode

Inputting Chinese Characters

Q: How are Chinese characters input?

A: All keyboards, no matter what symbols appear on the keycaps themselves, convert individual key presses into intermediate electronic signals that are then interpreted by low-level layers of software into sequences of input characters (or commands). Characters themselves are not hard-wired into keys.

Because the set of Chinese characters is so huge, it is highly impractical (and for any practical keyboard, impossible) to try to map each character to a single key. Therefore, all keyboards for inputting Chinese characters make use of schemes involving sequences of key presses to select specific Chinese characters or sequences of characters from the available repertoire supported. [RC]

Q: Is there a common name for these schemes to input Chinese characters?

A: Yes, they are generally referred to as Input Method Editors, or IME's for short. Sometimes they are called simply “input methods.” Depending on what particular method they use for enabling the use to input their choices and select particular characters, IME's often have particular names. They may also differ in strategy between inputting Chinese characters for the Chinese language and Chinese characters for the Japanese language (kanji), based on different linguistic expectations of the users and differences in the particular repertoire of characters that needs to be supported. [RC]

Q: Are IME's part of the operating system?

A: When an operating system is prepared for use in East Asia, it always has one or more IME's built in, to make it practical for users to input their characters. However, applications sometimes provide their own input methods as well, which may provide alternative input strategies or which may be better suited to that particular application. Provision of a well-designed IME in an East Asian market may be a competitive advantage for a particular application in that market.  [RC]

Q: What kinds of of IME's are used for Chinese?

A: The most commonly seen input methods for Chinese make use of some kind of romanization. Others make use of CJK character component and stroke-based methods. Some may also allow direct input of hexadecimal character values. In addition to keyboard-based input methods, there are also handwriting-recognition systems that take input from a stylus, voice-recognition systems taking spoken input, and optical character recognition systems taking input from scans of handwritten or printed pages.  [RC]

Q: How does a romanization IME work for Chinese?

A: The most commonly used romanization in use today is 漢語拼音 Hànyǔ Pīnyīn, or just “pinyin” for short. Pinyin represents each syllable of Beijing Chinese (PRC Modern Standard) by means of a combination of Latin characters, optionally modified by tone marks. The tone marks consist either of numbers at the end of the syllable or diacritics placed on the main vowel.

A given syllable as romanized in pinyin may correspond to one or — more often — to many particular Chinese characters. The user types in the pinyin syllable as a sequence of Latin characters (and the tone indicators). When the syllable is to be converted to the correct Chinese character for input, the input method presents the user with a palette of characters having that pronunciation, from which to make the appropriate selection by keyboard (or mouse) action.

Single syllable pronunciations involve lots of homophones in Chinese (and even more so in Japanese), but disyllabic word combinations are much less ambiguous. So if the input method supports disyllabic or polysyllabic input, storing up romanized input for more than one syllable at a time before it is converted to Chinese characters, then the number of possible choices corresponding to that pronunciation is greatly reduced, and input can often be made much more efficient.

IME's may also make use of statistical information, to increase the speed of input by sorting choices so that the more common or likely ones appear at the beginning of the selection lists.  [RC]

Q: How do component- and stroke-based input methods work?

A: IME's based on components and strokes work by using the shape of a character, rather than romanization of its pronunciation. Users learn keys or key combinations for basic strokes and common component chunks of Chinese characters, or choose strokes and/or components by clicking on items in a palette.

Once the user has made a selection of character components, the IME seeks to identify characters in the repertoire matching those criteria. In this respect, component-based input is rather like a regular expression search, which can be as loose or as tight as the IME allows. Component and stroke input methods share, in some regards, the idea of a syntax for a systematic graphic description of Chinese characters, similar to that of Unicode Ideographic Description Characters. (See the text about Ideographic Description in Chapter 12 of the Unicode Standard.)

However, practical input methods are optimized to make it easier for the user to memorize the required key sequences and to minimize the number of key presses needed for inputting particular characters. For more information on component-based input and the descriptions of Chinese characters upon which they are based, see Wenlin's CDL XML application for describing Han (CJKV) characters.  [RC]

Q: How about hexadecimal input of Chinese characters?

A: Some applications permit direct input of Chinese characters by means of the Unicode hexadecimal code point for that character. This approach isn't particularly efficient, but it works as a fallback when an input method doesn't support a particular character or when a user is unfamiliar with that IME. The user can always look up the Unicode code point for a character in the radical/stroke index to the Unicode code charts, and then simply input the hexadecimal sequence by whatever convention the IME supports. See also this entry in the present FAQ. [RC]

Q: Where can I find out more about Chinese input methods?

A: For Mozilla, check out Seamonkey Input Method Specification. For general information, try googling “input method editor”. For information about specific vendor's IME's for particular languages, you can search on “Chinese input method” or “Japanese input method”. For general pages of links to links, try such locations as this[RC]