Technical Reports |
Draft Proposed Update
Unicode Technical Report #21Case Mappings
Version 5 Authors Mark Davis (mark.davis@us.ibm.com, home) Date 2001.11.03 This Version http://www.unicode.org/unicode/reports/tr21/tr21-5 Previous Version http://www.unicode.org/unicode/reports/tr21/tr21-4.3 Latest Version http://www.unicode.org/unicode/reports/tr21 Tracking Number 5
Summary
This document presents implementation guidelines for case operations: case conversion, case detection, and caseless matching.
Status
This document is a draft proposed update to an existing Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard and how to reference this document, see http://www.unicode.org/unicode/standard/versions/.
Contents
1 Introduction
Case is a normative property of characters in specific alphabets (Latin, Greek, Cyrillic, Armenian, and archaic Georgian) whereby characters are considered to be variants of a single letter. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lowercase letter (also known as small or minuscule). The uppercase letter is generally larger than the lowercase letter. Alphabets with case differences are called bicameral; those without are called unicameral.
Because of the inclusion of certain composite characters for compatibility, such as U+01F1 "DZ" LATIN CAPITAL LETTER DZ, there is a third case, called titlecase, which is used where the first character of a word is to be capitalized. An example of such a character is: U+01F2 "Dz" LATIN CAPITAL LETTER D WITH SMALL LETTER Z.Thus the three case forms are UPPERCASE, Titlecase, and lowercase.
Note: The term titlecase can also be used to refer to words where the first letter is an uppercase or titlecase letter, and the rest of the letters are lowercase. However, not all words in the title of a document or first words in a sentence will be titlecase.
The choice of which words to titlecase is language-dependent. For example, "Taming of the Shrew" would be the appropriate capitalization in English, not "Taming Of The Shrew". Moreover, the determination of what actually constitutes a word is also language-dependent. For example, l'arbre might be considered two words in French, while can't is considered one word in English.
Note that while the archaic Georgian script contained upper- and lowercase pairs, they are rarely used in modern Georgian.
The case mappings in the Unicode Character Database (UCD) are informative, default mappings. Case itself, on the other hand, has normative status. Thus, for example, 0041 "A" is normatively uppercase, but its lowercase mapping to 0061 "a" is informative. The reason for this is that case can be considered to be an inherent property of a particular character, but case mappings between characters are occasionally influenced by local conventions.
There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.
- In most cases, the titlecase is the same as the uppercase, but not always. For example, the titlecase of U+01F1 "DZ" capital dz is U+01F2 "Dz" capital d with small z.
- Case mappings may produce strings of different length than the original.
- For example, the German character U+00DF "ß" small letter sharp s expands when uppercased to the sequence of two characters "SS". This also occurs where there is no precomposed character corresponding to a case mapping, such as with U+0149 "ʼn" latin small letter n preceded by apostrophe.
- There are some characters that require special handling, such as U+0345 combining iota subscript.
- Characters may also have different case mappings, depending on the context.
- For example, U+03A3 "Σ" capital sigma lowercases to U+03C3 "σ" small sigma if it is followed by another letter, but lowercases to U+03C2 "ς" small final sigma if it is not.
- Characters may have case mappings that depend on the locale.
- For example, in Turkish the letter U+0049 "I" capital letter i lowercases to U+0131 "ı" small dotless i.
- Since many characters are really caseless (most of the IPA block, for example) and have no matching uppercase, the process of uppercasing a string does not mean that it will no longer contain any lowercase letters.
1.1 Reversibility
It is important to note that no casing operations are reversible. For example,
upper(lower(“John Brown”)) → “JOHN BROWN”
lower(upper(“John Brown”)) → “john brown”.
There are even single words like vederLa in Italian or the name McGowan in English, which are neither upper, lower, nor titlecase. This format is sometimes called innerCaps, and is often used in programming and in Web names. Once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation. There are also single characters that do not have reversible mappings, such as the Greek sigmas above.
For word processors that use a single command-key sequence to toggle the selection through different casings, it is recommended to save the original string, and return to it in the sequence of keys. The user interface would produce the following results in response to a series of command-keys. Notice that the original string is restored every fourth time.
The quick brown
THE QUICK BROWN
the quick brown
The Quick Brown
The quick brown (repeating from here on)
Uppercase, titlecase, and lowercase can be represented in a word processor by using a character style. Removing the character style restores the text to its original state. However, if this approach is taken, any spell-checking software needs to be aware of the case style so that it can check the spelling according to the actual appearance.
1.2 Data
The Unicode Character Database contains four files with information that is relevant to case mapping:
UnicodeData.txt Contains the case mappings that map to a single character. These do not increase the length of strings, and do not contain context-dependent mappings. Only legacy implementations that cannot handle case mappings that increase string lengths use UnicodeData case mappings alone. The single-character mappings are insufficient for languages such as German.
SpecialCasing.txt Contains additional case mappings that map to more than one character, such as "ß" to "SS". It also contains context-dependent mappings, with flags to distinguish them from the normal mappings. There are some characters that have a "best" single-character mapping in UnicodeData and also have a full mapping in SpecialCasing. CaseFolding.txt Contains data for performing locale-independent case- folding, as described in 2.3 Caseless Matching. PropList.txt Contains definitions of the properties Other_Lowercase and Other_Uppercase. A set of charts that show the Unicode 3.0 case mappings in are also available online. The index page is ordered by general category and script. The codepoints are sorted by lowercased NFKC, to place related characters next to one another.
In addition, Normalization Form D (NFD) from UAX #15, "Unicode Normalization Forms is used in the definitions for case mapping.
The full case mappings for Unicode characters are obtained by using the mappings from SpecialCasing plus the mappings from UnicodeData, excluding any latter mappings that would conflict. Any character that does not have a mapping in these files is considered to map to itself. In this document, the full case mappings are referred to as UCD_lower(x), UCD_title(x), and UCD_upper(x).
1.2.1 Context-Dependent Mappings
The context-dependent case mappings are used in all of these functions, although they affect very few characters. The conditions are described in detail in the header of the SpecialCasing file.
Because there are very few context-dependent case mappings, implementations may choose to hard-code the treatment of these characters rather than use data-driven code based on the UCD. When this is done, every time the implementation is upgraded to a new version of Unicode, the code must be checked for consistency with the updated data.2 Guidelines
There are a number of fine points in case operations that programmers need to be aware of in doing case conversion, case detection, and caseless matching.
Detection of case and case mapping requires more than just the general category values (Lu, Lt, Ll). The following definitions are used:
D1. A character X is defined to be cased if it meets any of the following criteria:
- The general category of X is
- Uppercase Letter (Lu), or
- Lowercase Letter (Ll), or
- Titlecase Letter (Lt)
- In PropList.txt, X has one of the properties
- Other_Uppercase, or
- Other_Lowercase
- Given Y = NFD(X), then it is not the case that:
- Y = UCD_lower(Y) = UCD_upper(Y) = UCD_title(Y)
D2. A character is defined to be titlecase-ignorable if it meets any of the following criteria:
- The general category of X is
- Nonspacing Mark (Mn), or
- Enclosing Mark (Me), or
- Format Control (Cf)
- Letter Modifier (Lm)
- X is one of the following characters
- U+0027 APOSTROPHE
- U+00AD SOFT HYPHEN (SHY)
- U+2019 RIGHT SINGLE QUOTATION MARK
(the preferred character for apostrophe)2.1 Case Conversion of Strings
upper(X)
- Map each character X to UCD_upper(X).
- Remember to use the context-dependent mappings.
lower(X)
Map each character x to UCD_lower(X).
Remember to use the context-dependent mappings above.
title(X)
- For each character X, find the preceding character Y.
- ignore any intervening titlecase-ignorable characters when finding Y.
- If Y exists, and is cased
- map X to UCD_lower(x)
- Otherwise,
- map X to UCD_title(x)
- Remember to use the context-dependent mappings above, and consider the titlecase caveats.
2.2 Case Detection for Strings
The simplest mechanisms for determining the case of a string are based upon the case conversion operations. Given a string X, and a Y = NFD(X), then:
- X is lowercase if lower(Y) = Y
- X is uppercase if upper(Y) = Y
- X is titlecase if title(Y) = Y
- X is cased if it is not the case that:
- Y = lower(Y) = upper(Y) = title(Y)
While these are the logical definitions, actual implementations can optimize the detection of case.
Examples:
Lowercase "a", "john smith", "a1", "1" Uppercase "A", "JOHN SMITH", "A1", "1" Titlecase "A", "John Smith", "A1", "1" As seen from the examples, these conditions are not exclusive.
2.3 Caseless Matching
Caseless matching is commonly implemented using case-folding. The latter is the process of mapping strings to a canonical form where case differences are erased. Case-folding allows for fast caseless matches in lookups, since only binary comparison is required. Case-folding is more than just conversion to lowercase. For example, it handles cases such as the Greek sigma, so that "Μάϊος" and "ΜΆΪΟΣ" will match correctly.
Note: normally the original source string is not replaced by the folded string, since that may erase important information. For example, the name "Marco di Silva" would be folded to "marco di silva", losing the information as to which letters are capitalized. What is typically done is that the original string is stored along with a case-folded version for fast comparisons.
The CaseFolding.txt file in the Unicode Character Database is used for performing locale-independent case-folding. This file is generated from the case mappings in the Unicode Character Database, using both the single-character mappings and the multi-character mappings. It folds all characters having different case forms together into a common form. To compare two strings for caseless matching, you can fold each string using this data, and then use a binary comparison.
For those concerned with the details. Case-folding logically involves a set of equivalence classes, constructed from the Unicode Character Database case mappings as follows.
For each character X in Unicode:
- If X is already in an equivalence class, continue to next character.
- Otherwise, form a new equivalence class, and add X.
- Then add whatever upper-, lower- or titlecases to anything in the set.
- Then add whatever anything in the set upper-, lower- or titlecases to.
- Repeat #3 and #4 until nothing further is added.
Each equivalence class is completely disjoint from all the others, and together they form a partition of the entire Unicode code space. From each class, one representative element (a single lowercase letter where possible) is chosen to be the common form. CaseFolding.txt thus contains the mappings from other characters in the equivalence characters to their common forms.
Generally, where case distinctions are not important, other distinctions between Unicode characters (in particular, compatibility distinctions) are ignored as well. In such circumstances, text can be normalized to Normalization Form KC or KD after case-folding, to produce a normalized form that erases both compatibility distinctions and case distinctions. (See UTR #15: Unicode Normalization Forms for more information.) However, such normalization should generally only be done on a restricted repertoire, such as identifiers (alphanumerics).
Caseless matching itself is only an approximation to the language-specific rules governing the strength of comparisons. Where locale-sensitive case matching is used, this information can be derived from the collation data for the language, where only the first and second level differences are used. For more information, see UTR #10: Unicode Collation Algorithm.
However, in most environments, such as in file systems, text is not and cannot be tagged with locale information. In such cases, the locale-specific mappings must not be used. Otherwise data structures such as B-trees, might be built based on one set of case-foldings, and used based on a different set. This will cause those data structures to become corrupt. For such environments, a constant, locale-independent case-folding is required.
Modifications
The following summarizes modifications from the previous versions of this document.
5
- Expanded definitions for the new Other_Lowercase and Other_Titlecase properties. This also allowed the definitions to be simplified.
- Minor editing
4.3
- Defined the sets lower, title, upper, and uniqueUpper instead of relying on the general category.
- Introduced UCD_title, UCD_upper, UCD_lower notation.
- Reordered sections of text for clarity
- Minor editing
4.2
- Fixed pointer for CaseFolding.txt to point to the UCD
- Added text to describe the CaseFolding.txt generation in terms of equivalence classes
- Added Modification section
- Minor editing
Copyright © 1999-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.