[Unicode]  Frequently Asked Questions Home | Site Map | Search

Character Properties, Case Mappings & Names FAQ

Case Mapping

Character Properties & Names

Case Mapping

Q: Is all of the Unicode case mapping information in UnicodeData.txt?

A: No. The UnicodeData.txt file includes all of the one-to-one case mappings. Since many parsers were built with the expectation that UnicodeData.txt would have at most a single character in each case mapping field, the file SpecialCasing.txt was added to provide the one-to-many mappings, such as the one needed for uppercasing ß (U+00DF LATIN SMALL LETTER SHARP S). In addition, CaseFolding.txt contains additional mappings used in case folding and caseless matching. For more information, see Section 5.18, Case Mappings in The Unicode Standard.

Q: What is the difference between case mapping and case folding?

A: Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is primarily used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.

Q: Do all scripts have an uppercase and a lowercase?

A: No, most scripts do not have case. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.

Q: What is titlecase? How is it different from uppercase?

A: Titlecase takes its name from the case format used when forming a title, in which the initial letter in a word is capitalized and the rest are not. Titlecase is also used in forming a sentence by capitalizing the first word, and for forming proper names. The titlecase mapping in the Unicode Standard is the mapping applied to the initial character in a word.

The titlecase mapping in Unicode differs from the uppercase mapping in that a number of characters require special handling. These are chiefly ligatures and digraphs such as 'fl', 'dz', and 'lj', plus a number of polytonic Greek characters. For example, U+01C7 (LJ) maps to U+01C8 (Lj) rather than to U+01C9 (lj).

Q: Does the default case mapping work for every language? What about the default case folding?

A: The Unicode Standard defines the default case mapping for each individual character, with each character considered in isolation. This mapping does not provide for the context in which the character appears, nor for the language-specific rules that must be applied when working in natural language text.

By contrast, case folding, which is based on the lowercase mapping, is intended to be language-neutral. Since the case folding rules do not vary by language or context, this makes them unsuitable as the basis for displaying or transforming text for human consumption.

To make case mapping language sensitive, the Unicode Standard specificially allows—but does not provide the necessary data—implementations to tailor the mappings for each language. The file SpecialCasing.txt is included in the Standard as a guide to a few of the more important individual character mappings needed for specific languages, notably the Greek script and the Turkic languages. However, for most language-specific mappings and tailoring, users should refer to CLDR and other resources.

Q: What is 'tailoring' and how might it affect case mapping?

A: Tailoring is the modification of the case mapping rules to meet the specific needs of a given language, culture, or orthography. For example, while the default uppercase mapping of "a" is "A" and the default mapping of "à" is "À", the uppercase conversion of "je vais à Paris" in some forms of French might be "JE VAIS A PARIS". Notice how the "à" is uppercased as "A" in this case.

Similarly, in English, one of Proust's novels is rendered in titlecase as "In Search of Lost Time". Notice that the 'o' in 'of' is not capitalized, although the remainder of the words follow the Unicode Standard's definition of titlecase: this is an English-specific tailoring of titlecase. The original French title of this work is rendered in titlecase as "À la recherche du temps perdu". Here, only the first word is in the default titlecase; the others follow rules specific to a particular French convention.

Q: Why isn't there an "Ij" character encoded to serve as the titlecase for U+0132 LATIN CAPITAL LETTER IJ and U+0133 LATIN SMALL LETTER IJ?

The Unicode Standard encodes these two compatibility characters to provide support for roundtrip conversion of the Dutch letter 'ij' in certain very rare legacy (non-Unicode) character encodings. It is strongly preferred (and far more common) to use the two character ASCII sequence 'ij' to represent this letter instead.

In Dutch, the letter 'ij' behaves like the other single letters, so the correct titlecase mapping of U+0133 (ij) is U+0132 (so a word such as "ijsje" titlecases as "IJsje"). That is, the titlecase mapping for both of these characters is U+0132 and no additional character is needed.

Q: Are case mappings reversible?

A: No, case mapping loses information and thus does not allow for a round trip. For example, when the string "Mark" is lowercased, the original form cannot be recovered; it might have been "mark" or "MARK" instead. Some strings contain contextual case distinctions that are not preserved by case mapping. Consider the English word "anglo-American", the Italian word "vederLa", or the German words "haben" and "Haben". Once you uppercase, lowercase or titlecase these strings, you can't recover the original just by performing the reverse operation.

Q: What about individual characters? Aren't these reversible?

A: No, many of the individual character case mappings cannot be reversed. For example:

  • Some characters have multiple characters that map to them. For example, in the Greek script, capital sigma (U+03A3) is the uppercase form of both the regular (U+03C2) and final (U+03C3) lowercase sigma.
  • Some character mappings result in a decomposition. For example, the uppercase mapping of the 'fl' ligature (U+FB02 LATIN SMALL LIGATURE FL) maps to 'F' followed by 'L'.
  • Some case mappings depend on language or locale. For example, in Turkish, the lowercase letter 'i' maps to an uppercase dotted I (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE), while the uppercase letter 'I' maps to the dotless lowercase i (U+0131).

Q: Does uppercasing of a string eliminate all of the lowercase letters in it?

A: No. Some letters (notably those in the IPA block) have no matching case equivalent. As a result, uppercasing a string may not eliminate all of the lowercase letters in it.

Q: Why is there no unique uppercase character for ſ — U+017F LATIN SMALL LETTER LONG S (and about one hundred other characters)?

A: There are over 100 lowercase letters in the Unicode Standard that have no direct uppercase equivalent. For example, the uppercase form for long s is an ordinary capital S. Another example would be the LATIN SMALL LETTER DOTLESS J: the capital J is already dotless, so an extra letter isn't needed as an uppercase mapping. Some of the other characters with no uppercase equivalent are compatibility characters. Many of these, such as 'fl' (U+FB02 LATIN SMALL LIGATURE FL), decompose to two or more characters when casing is applied. Finally, others are characters that are only used in lowercase, such as many characters used for IPA and other phonetic systems. Text in IPA, like that in many other phonetic systems should never be case converted, even those IPA characters that do have an uppercase equivalent.

Q: Why aren't there extra characters encoded to support locale-independent casing for Turkish?

A: The Turkish language, like other Turkic languages, distinguishes a dotted letter 'i' from a dotless letter 'ı' (U+0131 LATIN SMALL LETTER DOTLESS I). In these languages, each has an equivalent uppercase mapping: U+0131 maps to the ordinary letter 'I', while 'i' maps to U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE).

Historically, users generally did not distinguish between the ASCII letters and their Turkish equivalents, so legacy character encodings, such as ISO 8859-9, which support the Turkic languages, did not separately encode characters to serve as the basis for locale-independent casing. These character encodings are often used for both Turkish and non-Turkish text. Transcoding this data to Unicode would be intolerably difficult if users had to somehow identify which 0x49 characters (for example) were ordinary "I" and which were CAPITAL LETTER DOTLESS I. In addition, because users are not used to making the distinction, it is unlikely that they would input the "correct" additional letters, even if they existed.

Q: Why does ß (U+00DF LATIN SMALL LETTER SHARP S) not uppercase to U+1E9E LATIN CAPITAL LETTER SHARP S by default?

A: In standard German orthography, the sharp s ("ß") is uppercased to a sequence of two capital S characters. This is a longstanding practice, and is reflected in the default case mappings in Unicode. A capital form of ß is attested in a number of instances, and has thus been encoded in the Unicode Standard. However, this character is not widely used, and is not recognized in the official orthography as the uppercase form of ß. Therefore, the original mapping to "SS" is retained in the Unicode character properties.

Q: Why does the Greek letter sigma require special handling?

A: Near the end of the SpecialCasing.txt, there are two lines that are commented out pertaining to the Greek letter sigma. At first glance, they may look a bit odd:

# 03C3; 03C2; 03A3; 03A3; FINAL; # GREEK SMALL LETTER SIGMA
# 03C2; 03C3; 03A3; 03A3; NON_FINAL; # GREEK SMALL LETTER FINAL SIGMA

Both of these lines refer to conditional case mappings (column 5). In normal Greek text, a U+03C3 (non-final sigma) should be written as U+03C2 (final sigma) if it is at the end of a word, and a U+03C2 (final sigma) should be written as a U+03C3 (non-final sigma) if it is not at the end of a word. This is what these two lines would mean if they were uncommented. The reason that they are commented out is that the SpecialCasing file is not intended to normalize the appearance of a lowercase sigma.

Q: Is case folding stable between Unicode versions?

Yes. Beginning with Unicode 5.0, the definition of case folding was stabilized. Any string that is case-folded according to the rules in version 5.0 or later is guaranteed to remain case-folded in any subsequent version of Unicode. One side effect of this is that no new lowercase character will be introduced for an existing upper- or titlecase character that has no lowercase pairing. The reverse is not true: a character that lacks an uppercase mapping or a titlecase mapping could acquire one in some future version.

Note that the stability guarantee only applies to assigned code points. New scripts or characters added to Unicode can include additional case mapping pairs, as long as no existing code point ever changes its lowercase mapping. That is, when encoding a new script, a new lowercase letter might be introduced for an uppercase character introduced at the same time. For additional stability guarantees, see the Unicode Character Encoding Stability Policy.


Character Names and Properties

Q: What are character names for and why are some characters named in unusual ways?

A: Character names are defined so that a mnemonic string can be used to uniquely identify a character, rather than representing it with just a hexadecimal code. Characters can have multiple uses or multiple common names, so a single identifier cannot provide a natural name for all users and all purposes. Sometimes, names are deliberately chosen to describe the appearance of a character, rather than its meaning or function, because the character is used in many competing contexts. Such use of descriptive names is particularly common for symbols.

Because characters names are identifiers, there are some additional restrictions and conventions, which govern the way they are assigned and provide some uniformity in naming. In many instances, descriptive comments and informative aliases are added to the listing of the character names in the code charts to make it easier for users to select the right character for the right purpose.

Q: Can the name of a character be changed to better reflect the way it is used?

A: Once a character name has been given, it cannot be changed. Because names are identifiers, for which stability is very important, the Unicode Character Encoding Stability Policy explicitly prevents character names from being changed. Character names, however, can be annotated in the code charts. For example, U+0674 ARABIC LETTER HIGH HAMZA is annotated as being used for Kazakh, not Arabic.

Q: Should I be concerned if the name of a script, block or character doesn't reflect the way it is used?

A: Script, block, and character names are used by Unicode solely as identifiers; that is, their purpose is to distinguish entities and not to describe them. Changes to these names create extensive interoperability and backward compatibility issues. There is usually a relationship between the name of a block, the name of the script that uses characters in that block, and the names of the characters themselves in order to ease identification.

The use of a particular name as an identifier for a script in the Unicode Standard does not imply an endorsement of that name as the best alternative for general use. The Unicode Consortium does not make recommendations on how to refer to scripts in other contexts.

Q: How are script and block names related to character names?

A: Many character names contain a script designator. For example, many characters in the Telugu script contain the word "TELUGU" in the first part of their names. This script designator is based on the name of the script, in this case "Telugu". For consistency, the script name is also reflected in block names, whenever blocks contain characters primarily of one script.

Q: What are the script names in the Unicode Standard based on?

A: In nearly all cases, the script names are based on common English usage. When there are important alternative names for scripts, they are often provided as annotations in the code charts and documentation. For example, the New Tai Lue script is referred to in China as Xishuang Banna Dai, which is listed as an alternative name in the code charts. The local name for a script may differ from English usage. Translated versions of the character names list would use translations of the script names and designators and follow local usage.

Q: Can I determine the script of a character by the character or block name?

A: No, not at all. The character names and block names are not reliable indicators of the script of a character. The Script property should be used instead to determine the script of any particular character. For example, as of Unicode 6.0 there were the following mismatches between Script property value and character or block names for Latin and Greek:

  • 149 characters have the Latin Script property value, but do not have "LATIN" in their character names.
  • 280 characters that have "LATIN" in their character names do not have the Latin Script property value.
  • 17 characters have the Greek Script property value, but are not in blocks that have "Greek" in the block name.
  • 66 characters that are in blocks that have "Greek" in the block name do not have the Greek Script property value.

For more information, see UAX #24, Unicode Script Property.

Q: Are there any tools available to convert character values to character names, or to tell me the script of a character?

A: Yes, there are several such tools listed on the Online Tools page of the Unicode site. Here are a few you might like to try.

Web based lookup:

  • Mark Davis' CLDR tools handle many kinds of conversions, etc.
  • Richard Ishida's String Analyser tells you the name and script block of one or more characters. If you are starting with character codes, use the "View Names" button above the "Characters" text box in his Unicode Code Converter or use UniView.
  • Andrew West’s What Unicode character is this? javascript tool converts Unicode characters to Unicode character names.
  • Johannes Bergerhausen's DecodeUnicode allows you to look up information about characters.

Downloadable application code:

  • The uniname utility in Bill Poser’s unidesc package tells you the name of a character.
  • Tom Christiansen's Unicode::Tussle (download) is a distribution of curious and various Unicode utilities.
  • The orphaned program of a student of Janusz S. Bien also handles named sequences.

A simple standard Perl program may be what you want, for example to view the name of character U+1234:
      $ perl -e 'use charnames();print charnames::viacode(0x1234),"\n"'

See the Online Tools page for more links.

Q: The character name alias for the control character U+0082 is BREAK PERMITTED HERE. Does that mean I have to interpret that control character in that way?

A: The Unicode Standard does not define U+0082 to mean "BREAK PERMITTED HERE". Formally this character is simply one of 65 control codes, one which in ISO 6429 has the name and meaning of "BREAK PERMITTED HERE". Implementers of the Unicode Standard are not required to interpret the character U+0082 in accordance with ISO 6429 (or to interpret it at all).

The standard does assign particular properties and semantics for certain controls commonly used in text files including tab, carriage return, line feed, form feed, and next line. However, it does not give the majority of control codes any semantics at all; that is left to a higher-level protocol.

The character names for control characters are actually undefined, however, name aliases, such as "BREAK PERMITTED HERE" have been defined. These aliases are based on ISO 6429, and can be used to identify specific controls, for example in regular expressions. For other control characters see http://www.unicode.org/charts/PDF/U0080.pdf.

Q: Where can I find formal definitions of the terms used in character names? In particular designations like "turned", "inverse", "inverted", "reversed", "rotated".

A: These terms are basically typographical rather than Unicode-specific.

A turned character is one that has been rotated 180 degrees around its center. A turned "e" winds up with the opening in the upper left portion. U+0259 LATIN SMALL LETTER SCHWA is a turned "e".
An inverted character has been flipped along the horizontal axis. An inverted "e" winds up with the opening in the upper right portion. There is no Unicode character representing an inverted "e". A reversed character has been flipped along the vertical axis.
A reversed "e" winds up with the opening in the lower left portion. U+0258 LATIN SMALL LETTER REVERSED E is a reversed "e".
A rotated character has been rotated 90 degrees, but one can't tell which way without looking at the glyph. U+213A ROTATED CAPITAL Q is a "Q" that has been rotated counterclockwise.
"Inverse" means that the white parts of the glyph are made black, and vice versa. An inverse "e" looks like a normal "e" but is white on a black background. There is no Unicode character representing an inverse "e". [JC]

Q: Are any unassigned characters or reserved characters given default properties?

A: Yes, default values are defined for all character properties. For a discussion of how this works and details about particular default values for properties, see UAX #44, Unicode Character Database.

Q: Unicode now treats the SOFT HYPHEN as format control (Cf) character when formerly it was a punctuation character (Pd). Doesn't this break ISO 8859-1 compatibility?

A: No. The ISO 8859-1 standard defines the SOFT HYPHEN as "[a] graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen" (section 6.3.3), but does not specify details of how or when it is to be displayed, nor other details of its semantics. The soft hyphen has had a long history of legacy implementation in two or more incompatible ways.

Unicode clarifies the semantics of this character for Unicode implementations, but this does not affect its usage in ISO 8859-1 implementations. Processes that convert back and forth may need to pay attention to semantic differences between the standards, just as for any other character.

In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the soft hyphen as a hyphen in all circumstances. The change in semantics of the Unicode character does not require that implementations of terminal emulators in other environments, such as ISO 8859-1, make any change in their current behavior.

Q: Where can I find the numerical values of characters with the hexadecimal digit (Hex_Digit) property?

A: The Unicode Standard provides the Hex_Digit property, which specifies which characters are hexadecimal digits: 0-9, A-F, a-f, and their fullwidth equivalents. (The ASCII_Hex_Digit property specifies the intersection of the Hex_Digit property and the Basic Latin block.) There is no table in the UCD mapping the hexadecimal digit characters to their values, analogous to the Numeric_Value property. The table linked here removes this real, if trivial, gap. [JC]

Q: How does Unicode cope with hexadecimal digits?

The hexadecimal number system, used in computing, is not that special: you can base a number system on any natural number except the number 1. The most widely used base is 10, but 2, 8, and 12 have also seen extensive use as number bases, whether in computing or archaic mathematics. Hence, it is not wise to define a particular set of digits for every number system somebody might wish to apply.

Rather, Unicode, much like its predecessors, assumes that hexadecimal numbers be written with the ordinary (decimal) digits (representing zero through nine), and the letters A through F (representing ten to fifteen). Only from context, it becomes clear whether a string of digits is to be meant as a number, and if so, in which number system.

Most applications have defined particular syntax rules to help distinguishing decimal, octal, or hexadecimal numbers from other input tokens, e. g., in some programming languages, “2010” is a decimal number, “0x7DA” is a hexadecimal number, “thisYear” is an identifier. In absence of such syntactical hints, you could peruse the Hex_Digit property from the Unicode Character Database to identify hexadecimal numbers; however, a string of Hex_Digit characters, such as “bed”, is not necessarily meant to be read as a hexadecimal number.

Whenever it is important that hexadecimal numbers in a table align vertically, you should choose a fixed-pitch font for them by means of a higher-level protocol. Some fonts will also show the uppercase hexadecimal digits at the same height as the digits. Such a font is used in the Unicode code charts to give 4- and 5-digit hexadecimal numbers a nice rectangular appearance. [OS]

Q: Why is the hacek accent called "caron" in Unicode?

A: Nobody knows.

Legend has it that the term was first spotted in one of the 'giant books' from the '30s at Mergenthaler Linotype Company in Brooklyn, NY; but no one has been able to confirm that.

More accurate reports trace the term back to the mid '80s where we do have documented sightings of "caron" in publications such as:

  • The TypEncyclopedia by Frank Romano, ISBN: 0-835-21925-9, Libraries Unlimited; 1984
    p. 6 shows the mark with the notation "caron/hacek/clicka"

  • IBM's Green Book which has an original copyright date of 1986.
    "Caron Accent" appears on p. K-432, in a table entitled "Diacritic Mark Special Graphic Characters."
    National Language Support Reference Manual. 4th ed. 1994. (National Language Design Guide, 2)

  • SGML documentation from ISO 8879:1986, see isodia.html

Unicode and the ISO 8859 series of standards just carried the tradition along.

In an article published in 2001: "Orthographic diacritics and multilingual computing",  J.C. Wells — a linguist at the University College in London — writes:
"The term ‘caron’, however, is wrapped in mystery. Incredibly, it seems to appear in no current dictionary of English, not even the OED."

Whoever the originators were, we suspect that they have probably taken their secrets to the grave by now.

Q: Where are private-use characters used, and how should they be handled?

A: This is the topic of the Private-Use Characters FAQ, which answers many questions about the handling of private-use characters.