Unicode Technical Report #15
Unicode Normalization Forms

 
Revision 16
Authors Mark Davis (mark@unicode.org), Martin Dürst (duerst@w3.org)
Date 1999-08-16
This Version http://www.unicode.org/unicode/reports/tr15/tr15-16.html
Previous Version http://www.unicode.org/unicode/reports/tr15/tr15-15.html
Latest Version http://www.unicode.org/unicode/reports/tr15
Unicode Technical Reports http://www.unicode.org/unicode/reports/

Summary

This document describes specifications for four normalized forms of Unicode text. With these forms, equivalent text (canonical or compatibility) will have identical binary representations.

Status of this document

This document contains informative material and normative specifications which has been considered and approved by the Unicode Technical Committee for publication as a Technical Report and as part of the Unicode Standard, Version 3.0 (forthcoming). Any reference to version 3.0 of the Unicode Standard automatically includes this technical report.

The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versions/ for more information.

This technical report may undergo further editorial work before the release of the Unicode Standard, Version 3.0. Please mail corrigenda and other comments to the authors.

Contents

§1 Introduction

The Unicode Standard, Version 3.0 describes several forms of normalization in Section 5.7 (Section 5.9 in Version 2.0). Two of these forms are precisely specified in Section 3.6. In particular, the standard defines a canonical decomposition format, which can be used as a normalization for interchanging text. This format allows for binary comparison while maintaining canonical equivalence with the original unnormalized text.

The standard also defines a compatibility decomposition format, which allows for binary comparison while maintaining compatibility equivalence with the original unnormalized text. The latter can also be useful in many circumstances, since it levels the differences between compatibility characters which are inappropriate in those circumstances. For example, the half-width and full-width katakana characters will have the same compatibility decomposition and are thus compatibility equivalents; however, they are not canonical equivalents.

Both of these formats are normalizations to decomposed characters. While Section 3.6 also discusses a normalization to composite characters (also known as decomposible or precomposed characters), it does not precisely specify the format. Because of the nature of the precomposed forms in the Unicode Standard, there is more than one possible specification for a normalized form with composite characters. This document provides a unique specification for those forms, and a label for each normalized form.

The four normalization forms are labeled as follows.

Title

Description

Specification

Normalization Form D Canonical Decomposition Sections 3.6, 3.10, and 3.11 of The Unicode Standard, also summarized under Decomposition
Normalization Form C Canonical Decomposition,
followed by Canonical Composition
see Specification
Normalization Form KD Compatibility Decomposition Sections 3.6, 3.10, and 3.11 of The Unicode Standard, also summarized under Decomposition
Normalization Form KC Compatibility Decomposition,
followed by Canonical Composition
see Specification

As with decomposition, there are two forms of normalization to composite characters, Form C and Form KC. The difference between these depends on whether the resulting text is to be a canonical equivalent to the original unnormalized text, or is to be a compatibility equivalent to the original unnormalized text. (In KC and KD, a K is used to stand for compatibility to avoid confusion with the C standing for canonical.) Both types of normalization can be useful in different circumstances.

Normalization Form C is basically the form of text which uses canonical composite characters where possible, and maintains the distinction between characters that are compatibility equivalents. Typical strings of composite accented Unicode characters are already in Normalization Form C. Implementations of Unicode which restrict themselves to a repertoire containing no combining marks (such as those that declare themselves to be implementations at Level 1 as defined in ISO/IEC 10646-1) are already typically using Normalization Form C. (Implementations of later versions of 10646 need to be aware of the versioning issues--see Versioning.) This is also the form of normalization currently chosen for use in W3C specifications; see the W3C Character Model document (http://www.w3.org/TR/WD-charmod) and the W3C Character Requirements document (http://www.w3.org/TR/WD-charreq).

Normalization Form KC additionally levels the differences between compatibility characters which are inappropriately distinguished in many circumstances. For example, the half-width and full-width katakana characters will normalize to the same strings, as will Roman Numerals and their letter equivalents. More complete examples are provided below. However, there is loss of information when text is transformed into Normalization Form KC, so it is not recommended for all circumstances.

To summarize the treatment of compability characters that were in the source text:

Normalization Form KC does not attempt to map characters to compatibility composites. For example, a compatibility composition of "office" does not produce "o\uFB03ce", even though "\uFB03" is a character that is the compatibility equivalent of the sequence of three characters 'ffi'.
Neither of the composition normalization forms C and KC are closed under string concatenation. For example, the strings "a" and "^" (combining circumflex) are both in form C, but the concatenation of the two ("a" + "^" => "a^") is not: the normalized form is the precomposed character "â". Without limiting the repertoire, there is no way to produce a composition normalized form that is closed under simple string concatenation. If desired, however, a specialized function could be constructed that produced a normalized concatenation. The decomposition normalization forms D and KD are closed under string concatenation and substringing.

All of the definitions in this document depend on the rules for equivalence and decomposition found in Chapter 3 of The Unicode Standard and the decomposition mappings in the Unicode Character Database.

Decomposition must be done in accordance with these rules. In particular, the decomposition mappings found in the Unicode Character Database must be applied recursively, and then the string put into canonical order.

§2 Notation

We will use the following notation for brevity:

§3 Versioning

Because additional composite characters may be added to future versions of the Unicode Standard, composition is less stable than decomposition. Therefore, it is necessary to specify a fixed version for the composition process, so that implementations can get the same result for normalization even if they upgrade to a new version of Unicode.

Decomposition is only unstable if an existing character's decomposition mapping changes. The Unicode Technical Committee has the policy of carefully reviewing proposed corrections in character decompositions, and only making changes where the benefits very clearly outweigh the drawbacks.

The fixed version of the composition process is defined by reference to a particular version of the Unicode Character Database, called the composition version. That composition version is specified to be version 3.0.0. For more information, see:

To see what difference the composition version makes, suppose that Unicode 4.0 adds the composite Q-caron. For an implementation that uses Unicode 4.0, strings in Normalization Forms C or KC will continue to contain the sequence Q + caron, and not the new character Q-caron, since a canonical composition for Q-caron was not defined in the composition version.

§5 Conformance

A process that produces Unicode text that purports to be in a Normalization Form shall do so in accordance with the specifications in this document.

A process that tests Unicode text to determine whether it is a in a Normalization Form shall do so in accordance with the specifications in this document.

The specifications for Normalization Forms are written in terms of a process for producing a decomposition or composition from an arbitrary Unicode string. This is a logical description--particular implementations can have more efficient mechanisms as long as they produce the same result. Similarly, testing for a particular Normalization Form does not require applying the process of normalization, so long as the result of the test is equivalent to applying normalization and then testing bit-for-bit identity.

§6 Specification

All combining character sequences start with a character of canonical class zero. For simplicity, we define a term for such characters:

D1. A character S is a starter if it has a canonical class of zero in the Unicode Character Database.

Because of the definition of canonical equivalence, the order of combining characters with the same canonical class makes a difference. For example, a-macron-breve is not the same as a-breve-macron. Characters can not be composed if that would change the canonical order of the combining characters.

D2. In any character sequence beginning with a starter S, a character C is blocked from S just in case there is some character B between S and C, and either B is a starter or it has the same canonical class as C.

When B blocks C, changing the order of B and C would result in a character sequence that is not canonically equivalent to the original. See Section 3.9 Canonical Ordering Behavior in the Unicode Standard.
If a combining character sequence is in canonical order, then testing whether a character is blocked only requires looking at the immediately preceding character.

The process of forming a composition in Normalization Form C or KC involves:

Figure 1 shows a sample of how this works in principle. The green cubes represent starters, and the gray cubes represent non-starters. In the first step, the string is fully decomposed, and reordered. In the second step, each character is checked against the last non-starter, and combined if all the conditions are met. Examples are provided in Annex 1: Examples, and a code sample is provided in Annex 5: Code Sample.

Figure 1

Figure 1

More formally, we require a precise notion of when an unblocked character can be composed with a starter. This uses the following two definitions.

D3. A primary composite is a character that has a canonical decomposition mapping in the Unicode Character Database (or is a canonical Hangul decomposition) but is not in the Composition Exclusion Table.

Hangul syllable decomposition is considered a canonical decomposition. See Technical Report #8: The Unicode Standard, Version 2.1 (http://www.unicode.org/unicode/reports/tr8.html) or The Unicode Standard, Version 3.0.

D3. A character X can be primary combined with a character Y just in case there is a primary composite Z which is canonically equivalent to the sequence <X, Y>.

Based upon these definitions, we can provide an explicit definition of the two Normalization forms that compose characters.

Normalization Form C

The Normalization Form C for a string S is obtained by applying the following process, or any other process that leads to the same result:

  1. Generate the canonical decomposition for the source string S according to the decomposition mappings in the latest supported version of the Unicode Character Database.
  2. Iterate through each character C in that decomposition, from first to last. If C is not blocked from the last starter L, and it can be primary combined with L, then replace L by the composite L-C, and remove C.

The result of this process is a new string S' which is in Normalization Form C.

Normalization Form KC

The Normalization Form KC for a string S is obtained by applying the following process, or any other process that leads to the same result:

  1. Generate the compatibility decomposition for the source string S according to the decomposition mappings in the latest supported version of the Unicode Character Database.
  2. Iterate through each character C in that decomposition, from first to last. If C is not blocked from the last starter L, and it can be primary combined with L, then replace L by the composite L-C, and remove C.

The result of this process is a new string S' which is in Normalization Form KC.

§7 Composition Exclusion Table

In the Unicode Character Database, two characters may have the same canonical decomposition. Here is an example of this:

Source Decomposition
212B ('Å' ANGSTROM SIGN) =>

0041 ('A' LATIN CAPITAL LETTER A)
+
030A ('°' COMBINING RING ABOVE)
00C5 ('Å' LATIN CAPITAL LETTER A WITH RING ABOVE)

However, in such cases, the Unicode Character Database will first decompose one of the characters to the other, and then decompose from there. That is, one of the characters (in this case ANGSTROM SIGN) will have a singleton decomposition. These singleton decompositions are some of the decompositions excluded from primary composition.

The characters having excluded decompositions are included in Unicode essentially for compatibility with certain pre-existing standards. They fall into four classes:

  1. Script-specifics: precomposed characters that are generally not the preferred form for particular scripts. These cannot be computed from information in the the Unicode Character Database.
  2. Post Composition Version: precomposed characters that are added to Unicode after the composition version is fixed. This set is currently empty, but will be updated with each subsequent version of Unicode. These cannot be computed from information in the the Unicode Character Database.
  3. Singletons: precomposed characters whose decompositions are single characters (as described above). These can be computed from information in the the Unicode Character Database.
  4. Non-starter decompositions: precomposed characters whose decompositions start with non-starter. These can be computed from information in the the Unicode Character Database.

A machine-readable form of the Composition Exclusion Table is found in ftp://ftp.unicode.org/Public/3.0-Update/.


Annex 1: Examples

This annex provides some detailed examples of the results of applying each of the composing normalization forms.

Normalization Form C and KC Examples:

The following examples are cases where the Forms C and KC are identical.

Original Form D, KD Form C, KC

Notes

a D-dot_above D + dot_above D-dot_above Both decomposed and precomposed canonical sequences produce the same result.
b D + dot_above D + dot_above D-dot_above
c D-dot_below + dot_above D + dot_below + dot_above D-dot_below + dot_above

By the time we have gotten to dot_above, it cannot be combined with the base character.

There may be intervening combining marks (see f), so long as the result of the combination is canonically equivalent.

d D-dot_above + dot_below D + dot_below + dot_above D-dot_below + dot_above
e D + dot_above + dot_below D + dot_below + dot_above D-dot_below + dot_above
f D + dot_above+ horn + dot_below D + horn + dot_below + dot_above D-dot_below + horn + dot_above
g E-macron-grave E + macron + grave E-macron-grave Multiple combining characters are combined with the base character.
h E-macron + grave E + macron + grave E-macron-grave
i E-grave + macron E + grave + macron E-grave + macron Characters will not be combined if they would not be canonical equivalents because of their ordering.
j angstrom_sign A + ring A-ring Since Å (A-ring) is the preferred composite, it is the form produced for both characters.
k A-ring A + ring A-ring

Normalization Form C and KC Examples:

The following are examples of Form C that illustrate how it differs from Form KC.

Original Form D Form C

Notes

l "Äffin" "A\u0308ffin" "Äffin" The ffi_ligature (U+FB03) is not decomposed, since it has a compatibility mapping, not a canonical mapping. (See Normalization Form KC Examples.)
m "Ä\uFB03n" "A\u0308\uFB03n" "Ä\uFB03n"
n "Henry IV" "Henry IV" "Henry IV" Similarly, the ROMAN NUMERAL IV (U+2163) is not decomposed.
o "Henry \u2163" "Henry \u2163" "Henry \u2163"
p ga ka + ten ga Different compatibility equivalents of a single Japanese character will not result in the same string in Normalization Form C.
q ka + ten ka + ten ga
r hw_ka + hw_ten hw_ka + hw_ten hw_ka + hw_ten
s ka + hw_ten ka + hw_ten ka + hw_ten
t hw_ka + ten hw_ka + ten hw_ka + ten
u kaks ki + am + ksf kaks

Hangul syllables are maintained under normalization.

Normalization Form KC Examples

The following are examples of Form KC that illustrate how it differs from Form C.

Original Form KD Form KC

Notes

l' "Äffin" "A\u0308ffin" "Äffin" The ffi_ligature (U+FB03) is decomposed in Normalization Form KC (where it is not in Normalization Form C).
m' "Ä\uFB03n" "A\u0308\ffin" "Äffin"
n' "Henry IV" "Henry IV" "Henry IV" Similarly, the resulting strings here are identical in Normalization Form KC.
o' "Henry \u2163" "Henry IV" "Henry IV"
p' ga ka + ten ga Different compatibility equivalents of a single Japanese character will result in the same string in Normalization Form KC.
q' ka + ten ka + ten ga
r' hw_ka + hw_ten ka + ten ga
s' ka + hw_ten ka + ten ga
t' hw_ka + ten ka + ten ga
u' kaks ki + am + ksf kaks

Hangul syllables are maintained under normalization. (In earlier versions of Unicode, jamo characters like ksf had compatibility mappings to kf + sf. These mappings were removed in Unicode 2.1.9 to ensure that Hangul syllables are maintained.)

Annex 2: Design Goals

The following are the design goals for the specification of the normalization forms, and are presented here for reference.

Goal 1: Uniqueness

The first, and by far the most important, design goal for the normalization forms is uniqueness: two equivalent strings will have precisely the same normalized form. More explicitly,

Goal 2: Stability

The second major design goal for the normalization forms is stability of characters that are not involved in the composition or decomposition process.

  1. If X contains a character with a compatibility decomposition, then D(X) and C(X) still contain that character.
     
  2. If the only decomposible characters in X are composites (see D3) and there are no combining characters, then C(X) = X.

There were four exceptions to Goal 2.2 in the Unicode Standard, Version 2.1. Four new characters were added to the Unicode Standard, Version 3.0 to remedy this situation. These are:

0226 LATIN CAPITAL LETTER A WITH DOT ABOVE
0227 LATIN SMALL LETTER A WITH DOT ABOVE
0228 LATIN CAPITAL LETTER E WITH CEDILLA
0229 LATIN SMALL LETTER E WITH CEDILLA

Goal 3: Efficiency

The third major design goal for the normalization forms is that it allow for efficient implementations.

  1. It is possible to implement efficient code for producing the Normalization Forms. In particular, it should be possible to produce Normalization Form C very quickly from strings that are already in Normalization Form C or are in Normalization Form D.

Annex 3: Implementation Notes

Efficiency

There are a number of optimizations that can be made in programs that produce Normalization Form C. Rather than first decomposing the text fully, a quick check can be made on each character. If it is already in the proper precomposed form, then no work has to be done. Only if the current character is combining or in the Composition Exclusion Table does a slower code path need to be invoked. (This code path will need to look at previous characters, back to the last starter. See Trailing Characters for more information.)

The majority of the cycles spent in doing composition is spent looking up the appropriate data. The data lookup for Normalization Form C can be very efficiently implemented, since it only has to look up pairs of characters, not arbitrary strings. First a multi-stage table (as discussed in Chapter 5 of the Unicode Standard) is used to map a character c to a small integer i in a contiguous range from 0 to n. The code for doing this looks like:

i = data[index[c >> BLOCKSHIFT] + (c & BLOCKMASK)];

Then a pair of these small integers are simply mapped through a two-dimensional array to get a resulting value. This yields much better performance than a general-purpose string lookup in a hash table.

Hangul

Since the Hangul compositions and decompositions are algorithmic, memory storage can be significantly reduced if the corresponding operations are done in code rather than by simply storing the data in the general purpose tables. Here is is sample code illustrating algorithmic Hangul canonical decomposition and composition done according to the specification in Section 3.11 Combining Jamo Behavior. Although coded in Java, the same structure can be used in other programming languages.

Common Constants

    static final int
        SBase = 0xAC00, LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7,
        LCount = 19, VCount = 21, TCount = 28,
        NCount = VCount * TCount,   // 588
        SCount = LCount * NCount;   // 11172

Hangul Decomposition

    public static String decomposeHangul(char s) {
        int SIndex = s - SBase;
        if (SIndex < 0 || SIndex >= SCount) {
            return String.valueOf(s);
        }
        StringBuffer result = new StringBuffer();
        int L = LBase + SIndex / NCount;
        int V = VBase + (SIndex % NCount) / TCount;
        int T = TBase + SIndex % TCount;
        result.append((char)L);
        result.append((char)V);
        if (T != TBase) result.append((char)T);
        return result.toString();
    }

Hangul Composition

Notice an important feature of Hangul composition: whenever the source string is not in Normalization Form D, you can not just detect character sequences of the form <L, V> and <L, V, T>. You also must catch the sequences of the form <LV, T>. To guarantee uniqueness, these sequences must also be composed. This is illustrated in Step 2 below.

    public static String composeHangul(String source) {
        int len = source.length();
        if (len == 0) return "";
        StringBuffer result = new StringBuffer();
        char last = source.charAt(0);            // copy first char
        result.append(last);

        for (int i = 1; i < len; ++i) {
            char ch = source.charAt(i);

            // 1. check to see if two current characters are L and V

            int LIndex = last - LBase;
            if (0 <= LIndex && LIndex < LCount) {
                int VIndex = ch - VBase;
                if (0 <= VIndex && VIndex < VCount) {

                    // make syllable of form LV

                    last = (char)(SBase + (LIndex * VCount + VIndex) * TCount);
                    result.setCharAt(result.length()-1, last); // reset last
                    continue; // discard ch
                }
            }

            // 2. check to see if two current characters are LV and T

            int SIndex = last - SBase;
            if (0 <= SIndex && SIndex < SCount && (SIndex % TCount) == 0) {
                int TIndex = ch - TBase;
                if (0 <= TIndex && TIndex <= TCount) {

                    // make syllable of form LVT

                    last += TIndex;
                    result.setCharAt(result.length()-1, last); // reset last
                    continue; // discard ch
                }
            }

            // if neither case was true, just add the character

            last = ch;
            result.append(ch);
        }
        return result.toString();
    }

Additional transformations can be performed on sequences of Hangul jamo for various purposes. For example, to regularize sequences of Hangul jamo into standard syllables, the choseong and jungseong fillers can be inserted, as described in Chapter 3. (In the text of the 2.0 version of the Unicode Standard, these standard syllables were called canonical syllables, but this has nothing to do with canonical composition or decomposition.) For keyboard input, additional compositions may be performed. For example, the trailing consonants kf + sf may be combined into ksf. In addition, some Hangul input methods do not require a distinction on input between initial and final consonants, and change between them on the basis of context. For example, in the keyboard sequence mi + em + ni + si + am, the consonant ni would be reinterpreted as nf, since there is no possible syllable nsa. This results in the two syllables men and sa.

However, none of these additional transformations are considered part of the Unicode Normalization Formats.

Hangul Character Names

Hangul decomposition is also used to form the character names for the Hangul syllables. While the sample code that illustrates this process is not directly related to normalization, it is worth including here.

    public static String getHangulName(char s) {
        int SIndex = s - SBase;
        if (0 > SIndex || SIndex >= SCount) {
            throw new IllegalArgumentException("Not a Hangul Syllable: " + s);
        }
        StringBuffer result = new StringBuffer();
        int LIndex = SIndex / NCount;
        int VIndex = (SIndex % NCount) / TCount;
        int TIndex = SIndex % TCount;
        return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex]
          + JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex];
    }

    static private String[] JAMO_L_TABLE = {
        "G", "GG", "N", "D", "DD", "R", "M", "B", "BB",
        "S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H"
    };

    static private String[] JAMO_V_TABLE = {
        "A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O",
        "WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI",
        "YU", "EU", "YI", "I"
    };

    static private String[] JAMO_T_TABLE = {
        "", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM",
        "LB", "LS", "LT", "LP", "LH", "M", "B", "BS",
        "S", "SS", "NG", "J", "C", "K", "T", "P", "H"
    };

Annex 4: Decomposition

For those accessing this document without access to the Unicode Standard, the following summarizes the canonical decomposition process. For a complete discussion, see Sections 3.6, 3.10 and 3.11 of the Unicode Standard.

Canonical decomposition is the process of taking a string, recursively replacing composite characters using the Unicode canonical decomposition mappings (including the algorithmic Hangul canonical decomposition mappings), and putting the result in canonical order.

Compatibility decomposition is the process of taking a string, replacing composite characters using both the Unicode canonical decomposition mappings and the Unicode compatibility decomposition mappings, and putting the result in canonical order.

A string is put into canonical order by repeatedly replacing any exchangeable pair by the pair in reversed order. When there are no remaining exchangeable pairs, then the string is in canonical order. Note that the replacements can be done in any order.

A sequence of two adjacent characters in a string is an exchangeable pair if the combining class (from the Unicode Character Database) for the first character is greater than the combining class for the second and the second is not a starter; that is, if CC(first) > CC(second) > 0.

Examples:

Sequence Combining classes Status
<acute, cedilla> 230, 202 exchangeable, since 230 > 202
<a, acute> 0, 230 not exchangeable, since 0 <= 230
<diaeresis, acute> 230, 230 not exchangeable, since 230 <= 230
<acute, a> 230, 0 not exchangeable, since the second class is zero.

Example:

  1. Take the string with the characters "ác´¸" (a-acute, c, acute, cedilla)
  2. The data file contains the following relevant information:
    code; name; ... canonical class; ... decomposition.
    0061;LATIN SMALL LETTER A;...0;...
    0063;LATIN SMALL LETTER C;...0;...
    00E1;LATIN SMALL LETTER A WITH ACUTE;...0;...0061 0301;...
    0107;LATIN SMALL LETTER C WITH ACUTE;...0;...0063 0301;...
    0301;COMBINING ACUTE ACCENT;...230;...
    0327;COMBINING CEDILLA;...202;...
  3. Applying the canonical decomposition mappings, we get "a´c´¸" (a, acute, c, acute, cedilla).
    • This is because 00E1 (a-acute) has a canonical decomposition mapping to 0061 0301 (a, acute)
  4. Applying the canonical ordering, we get "a´c¸´" (a, acute, c, cedilla, acute).
    • This is because cedilla has a lower canonical ordering value (202) than acute (230) does. The positions of 'a' and 'c' are not affected, since they are starters.

Annex 5: Code Sample

This section provides a code sample for the four different normalization forms. The sample is written in Java, though for accessibility it avoids the use of object-oriented techniques. (For a live demonstration of the code, see http://www.macchiato.com/mark/compose/. [Ed note: to be moved to the Unicode site.]) Equivalent Perl code written by Martin Dürst is available on the W3C site, at http://www.w3.org/International/charlint/.

For clarity, this sample is not optimized. The implementation transforms a string in two passes: first decomposing, then recomposing that result by successively composing each unblocked character with the last starter.

In some implementations, people may be working with streaming interfaces that read and write small amounts at a time. In those implementations, the text back to the last starter needs to be buffered. Whenever a second starter would be added to that buffer, the buffer can be flushed.

/**
 * Normalization Form Selector
 */

public static final byte 
    D = 0 , 
    C = COMPOSITION_MASK ,
    KD = COMPATIBILITY_MASK,
    KC = COMPATIBILITY_MASK + COMPOSITION_MASK;

/**
 * Normalizes text according to the chosen form, 
 * replacing contents of the target buffer.
 * @param   source      the original text, unnormalized
 * @param   target      the resulting normalized text
 */

public void normalize(byte form, String source, StringBuffer target) {

    // First decompose the source into target,
    // then compose if the form requires.
    
    internalDecompose(form, source, target);
    if ((form & COMPOSITION_MASK) != 0) {
        internalCompose(target);
    }
}

// INTERNAL METHODS

/**
 * Decomposes text, either canonical or compatibility,
 * replacing contents of the target buffer.
 * @param   form        the normalization form. If COMPATIBILITY_MASK
 *                      bit is on in this byte, then selects the recursive 
 *                      compatibility decomposition, otherwise selects
 *                      the recursive canonical decomposition.
 * @param   source      the original text, unnormalized
 * @param   target      the resulting normalized text
 */

void internalDecompose(byte form, String source, StringBuffer target) {
    target.setLength(0);
    for (int i = 0; i < source.length(); ++i) {
        getDecomposition(form, source.charAt(i), buffer);
        
        // add all of the characters in the decomposition.
        // (may be just the original character, if there was
        // no decomposition mapping)
        
        for (int j = 0; j < buffer.length(); ++j) {
            char ch = buffer.charAt(j);
            int chClass = combiningClass(ch);
            int k = target.length(); // insertion point
            if (chClass != 0) {

                // bubble-sort combining marks as necessary

                for (; k > 0; --k) {
                    if (combiningClass(buffer.charAt(k-1)) <= chClass) break;
                }
            }
            target.insert(k, ch);
        }
    }
}

/**
 * Composes text in place. Target must already
 * have been decomposed.
 * @param   target      input: decomposed text.
 *                      output: the resulting normalized text.
 */

void internalCompose(StringBuffer target) {
    
    int starterPos = 0, compPos = 1;
    char starterCh = target.charAt(0);
    int lastClass = combiningClass(starterCh);
    
    // Loop on the decomposed characters, combining where possible
    
    for (int decompPos = 1; decompPos < target.length(); ++decompPos) {
        char ch = target.charAt(decompPos);
        int chClass = combiningClass(ch);
        char composite = pairwiseCombines(starterCh, ch);
        if (composite != NOT_COMPOSITE
          && (lastClass < chClass || lastClass == 0)) {
            target.setCharAt(starterPos, composite);
            starterCh = composite;
        } else {
            if (chClass == 0) {
                starterPos = compPos;
                starterCh  = ch;
            }
            lastClass = chClass;
            target.setCharAt(compPos++, ch);
        }
    }
    target.setLength(compPos);
}

// helper functions for accessing Unicode data

/**
 * Gets recursive decomposition of a character from the 
 * Unicode Character Database.
 * @param   form    the normalization form. If COMPATIBILITY_MASK
 *                  bit is on in this byte, then selects the recursive 
 *                  compatibility decomposition, otherwise selects
 *                  the recursive canonical decomposition.
 * @param   ch      the source character
 * @param   buffer  buffer to be filled with the decomposition
 */

void getDecomposition(byte form, char ch, StringBuffer buffer) {...}

/**
 * Gets the combining class of a character from the
 * Unicode Character Database.
 * @param   ch      the source character
 * @return          value from 0 to 255
 */

int combiningClass(char ch) {...}

/**
 * Returns the composite of the two characters. If the two
 * characters don't combine, returns NOT_COMPOSITE.
 * @param   first   first character (e.g. 'c')
 * @param   first   second character (e.g. '¸' cedilla)
 * @return          composite (e.g. 'ç')
 */

char pairwiseCombines(char first, char second) {...}

/**
 * Constant for use in pairwiseCombines
 */

public static final int NOT_COMPOSITE = '\uFFFF';

/**
 * Constant for use in distinguishing forms
 */ 

static final byte COMPATIBILITY_MASK = 1, COMPOSITION_MASK = 2;

Annex 6: Legacy Encodings

While the Normalization Forms are specified for Unicode text, they can also be extended to non-Unicode (legacy) character encodings. This is based on mapping the legacy character set strings to and from Unicode.

D4. An invertible transcoding T for a legacy character set L is a mapping from strings encoded in L to strings in Unicode that has an associated mapping T-1 such that for any string S in L, T-1(T(S)) = S.

Typically there is a single accepted invertible transcoding for a given legacy character set. In in a few cases there may be multiple invertible transcodings: for example, Shift-JIS may have two different mappings used in different circumstances: one to preserve the '/' semantics of 2F16, and one to preserve the '¥' semantics. If you implement transcoders from legacy character sets, it is recommended that you ensure that the result is in Normalization Form C where possible.
The character indexes in the legacy character set string may be very different than character indexes in the Unicode equivalent. For example, if a legacy string uses visual encoding for Hebrew, then its first character might be the last character in the Unicode string.

D5. Given a string S encoded in L and an invertible transcoding T for L, the Normalization Form X of S under T is defined to be the result of mapping to Unicode, normalizing to Unicode Normalization Form X, and mapping back to the legacy character encoding, e.g., T-1(X(T(S))). Where there is a single accepted invertible transcoding for that character set, we can simply speak of the Normalization Form X of S.

Legacy character sets fall into three categories based on their normalization behavior with accepted transcoders.

Annex 7: Programming Language Identifiers

The Unicode Standard provides a recommended syntax for identifiers for programming languages that allow the use of non-ASCII languages in code. It is a natural extension of the identifier syntax used in C and other programming languages:

<identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*

<identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]

<identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]

That is, the first character of an identifier can be an uppercase letter, lowercase letter, titlecase letter, modifier letter, other letter, or letter number. The subsequent characters of an identifier can be any those, plus non-spacing marks, spacing combining marks, decimal numbers, connector punctuations, and formatting codes. (Normally the formatting codes are filtered out before storing or comparing identifiers.)

Normalization as described in this report can be used to avoid problems where apparently identical identifiers are not treated equivalently. Such problems can appear both during compilation and during linking, in particular also across different programming languages. To avoid such problems, programming languages should normalize identifiers before storing or comparing them, preferably in Normalization Form KC (while Normalization Form C can be used, KC eliminates variations that are probably not relevant to the specification of identifiers). This process is generally most reliable if the identifiers are normalized as they are parsed.

In addition, those programming languages with case-insensitive identifiers should also use the case mappings described in Unicode Technical Report #21, Case Mappings to produce a case-insensitive normalized form.

Annex 8: Trailing Characters

The Trailing Characters table lists the characters in Unicode 3.0 that may occur in a canonical decomposition of a character, but not as the first character of that decomposition. The inclusion of this table here is informative: the table can be generated from the Unicode Character Database.

If a string does not contain characters in the Trailing Characters table or in the Composition Exclusion Table, then none of its characters participate in compositions, so the only processing required for Normalization Form C is to make sure that the characters are in canonical order. The Other Non-Starters table contains all of the Unicode 3.0 non-starters that are neither in the Trailing Characters table nor in the Composition Exclusion table. If a string contains no characters from any of these three tables, then it is in Normalization Form C already.

Trailing Characters
0300 COMBINING GRAVE ACCENT
..0304 COMBINING MACRON
0306 COMBINING BREVE
..030C COMBINING CARON
030F COMBINING DOUBLE GRAVE ACCENT
0311 COMBINING INVERTED BREVE
0313 COMBINING COMMA ABOVE
0314 COMBINING REVERSED COMMA ABOVE
031B COMBINING HORN
0323 COMBINING DOT BELOW
..0328 COMBINING OGONEK
032D COMBINING CIRCUMFLEX ACCENT BELOW
032E COMBINING BREVE BELOW
0330 COMBINING TILDE BELOW
0331 COMBINING MACRON BELOW
0338 COMBINING LONG SOLIDUS OVERLAY
0342 COMBINING GREEK PERISPOMENI
0345 COMBINING GREEK YPOGEGRAMMENI
05B4 HEBREW POINT HIRIQ
05B7 HEBREW POINT PATAH
..05B9 HEBREW POINT HOLAM
05BC HEBREW POINT DAGESH OR MAPIQ
05BF HEBREW POINT RAFE
05C1 HEBREW POINT SHIN DOT
05C2 HEBREW POINT SIN DOT
093C DEVANAGARI SIGN NUKTA
09BC BENGALI SIGN NUKTA
09BE BENGALI VOWEL SIGN AA
09D7 BENGALI AU LENGTH MARK
0A3C GURMUKHI SIGN NUKTA
0B3C ORIYA SIGN NUKTA
0B3E ORIYA VOWEL SIGN AA
0B56 ORIYA AI LENGTH MARK
0B57 ORIYA AU LENGTH MARK
0BBE TAMIL VOWEL SIGN AA
0BD7 TAMIL AU LENGTH MARK
0C56 TELUGU AI LENGTH MARK
0CC2 KANNADA VOWEL SIGN UU
0CD5 KANNADA LENGTH MARK
0CD6 KANNADA AI LENGTH MARK
0D3E MALAYALAM VOWEL SIGN AA
0D57 MALAYALAM AU LENGTH MARK
0DCA SINHALA SIGN AL-LAKUNA
0DCF SINHALA VOWEL SIGN AELA-PILLA
0DDF SINHALA VOWEL SIGN GAYANUKITTA
0E32 THAI CHARACTER SARA AA
0EB2 LAO VOWEL SIGN AA
0F71 TIBETAN VOWEL SIGN AA
0F74 TIBETAN VOWEL SIGN U
0F80 TIBETAN VOWEL SIGN REVERSED I
0FB5 TIBETAN SUBJOINED LETTER SSA
0FB7 TIBETAN SUBJOINED LETTER HA
102E MYANMAR VOWEL SIGN II
1161 HANGUL JUNGSEONG A
..1175 HANGUL JUNGSEONG I
11A8 HANGUL JONGSEONG KIYEOK
..11C2 HANGUL JONGSEONG HIEUH
3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

Other Non-Starters
0305 COMBINING OVERLINE
030D COMBINING VERTICAL LINE ABOVE
030E COMBINING DOUBLE VERTICAL LINE ABOVE
0310 COMBINING CANDRABINDU
0312 COMBINING TURNED COMMA ABOVE
0315 COMBINING COMMA ABOVE RIGHT
..031A COMBINING LEFT ANGLE ABOVE
031C COMBINING LEFT HALF RING BELOW
..0322 COMBINING RETROFLEX HOOK BELOW
0329 COMBINING VERTICAL LINE BELOW
..032C COMBINING CARON BELOW
032F COMBINING INVERTED BREVE BELOW
0332 COMBINING LOW LINE
..0337 COMBINING SHORT SOLIDUS OVERLAY
0339 COMBINING RIGHT HALF RING BELOW
..033F COMBINING DOUBLE OVERLINE
0344 COMBINING GREEK DIALYTIKA TONOS
0346 COMBINING BRIDGE ABOVE
..034E COMBINING UPWARDS ARROW BELOW
0360 COMBINING DOUBLE TILDE
..0362 COMBINING DOUBLE RIGHTWARDS ARROW BELOW
0483 COMBINING CYRILLIC TITLO
..0486 COMBINING CYRILLIC PSILI PNEUMATA
0591 HEBREW ACCENT ETNAHTA
..05A1 HEBREW ACCENT PAZER
05A3 HEBREW ACCENT MUNAH
..05B3 HEBREW POINT HATAF QAMATS
05B5 HEBREW POINT TSERE
05B6 HEBREW POINT SEGOL
05BB HEBREW POINT QUBUTS
05BD HEBREW POINT METEG
05C4 HEBREW MARK UPPER DOT
064B ARABIC FATHATAN
..0655 ARABIC HAMZA BELOW
0670 ARABIC LETTER SUPERSCRIPT ALEF
06D6 ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
..06DC ARABIC SMALL HIGH SEEN
06DF ARABIC SMALL HIGH ROUNDED ZERO
..06E4 ARABIC SMALL HIGH MADDA
06E7 ARABIC SMALL HIGH YEH
06E8 ARABIC SMALL HIGH NOON
06EA ARABIC EMPTY CENTRE LOW STOP
..06ED ARABIC SMALL LOW MEEM
0711 SYRIAC LETTER SUPERSCRIPT ALAPH
0730 SYRIAC PTHAHA ABOVE
..074A SYRIAC BARREKH
094D DEVANAGARI SIGN VIRAMA
0951 DEVANAGARI STRESS SIGN UDATTA
..0954 DEVANAGARI ACUTE ACCENT
09CD BENGALI SIGN VIRAMA
0A4D GURMUKHI SIGN VIRAMA
0ABC GUJARATI SIGN NUKTA
0ACD GUJARATI SIGN VIRAMA
0B4D ORIYA SIGN VIRAMA
0BCD TAMIL SIGN VIRAMA
0C46 TELUGU VOWEL SIGN E
0C4D TELUGU SIGN VIRAMA
0C55 TELUGU LENGTH MARK
0CCD KANNADA SIGN VIRAMA
0D4D MALAYALAM SIGN VIRAMA
0E38 THAI CHARACTER SARA U
..0E3A THAI CHARACTER PHINTHU
0E48 THAI CHARACTER MAI EK
..0E4B THAI CHARACTER MAI CHATTAWA
0E4D THAI CHARACTER NIKHAHIT
0EB8 LAO VOWEL SIGN U
0EB9 LAO VOWEL SIGN UU
0EC8 LAO TONE MAI EK
..0ECB LAO TONE MAI CATAWA
0ECD LAO NIGGAHITA
0F18 TIBETAN ASTROLOGICAL SIGN -KHYUD PA
0F19 TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS
0F35 TIBETAN MARK NGAS BZUNG NYI ZLA
0F37 TIBETAN MARK NGAS BZUNG SGOR RTAGS
0F39 TIBETAN MARK TSA -PHRU
0F72 TIBETAN VOWEL SIGN I
0F7A TIBETAN VOWEL SIGN E
..0F7D TIBETAN VOWEL SIGN OO
0F82 TIBETAN SIGN NYI ZLA NAA DA
..0F84 TIBETAN MARK HALANTA
0F86 TIBETAN SIGN LCI RTAGS
0F87 TIBETAN SIGN YANG RTAGS
1037 MYANMAR SIGN DOT BELOW
1039 MYANMAR SIGN VIRAMA
17B5 KHMER VOWEL INHERENT AA
17D2 KHMER SIGN COENG
18A9 MONGOLIAN LETTER AG DAGALGA
20D0 COMBINING LEFT HARPOON ABOVE
..20DC COMBINING FOUR DOTS ABOVE
20E1 COMBINING LEFT RIGHT ARROW ABOVE
302A IDEOGRAPHIC LEVEL TONE MARK
..302F HANGUL DOUBLE DOT TONE MARK
FB1E HEBREW POINT JUDEO-SPANISH VARIKA
FE20 COMBINING LIGATURE LEFT HALF
..FE23 COMBINING DOUBLE TILDE RIGHT HALF

Annex 9: Conformance Testing

Implementations must be thoroughly tested for conformance to the normalization specification, especially for Normalization Form C. The following provides conditions that should be tested for in any implementation.

For every character X in Unicode, let the string Y be D(X), and the string Z be C(D(X)). Check that the following conditions for these strings are true:

  1. If X does not have a canonical decomposition mapping in the Unicode Character Database, then X = Y = Z.

    otherwise,
       

  2. Y and Z must be in canonical order
  3. X ≠ Y
  4. No character in Y can have a canonical decomposition mapping in the Unicode Character Database
  5. If X is in the CompositionExclusions table, then:
  6. If X is not in the CompositionExclusions table, then:

To test for canonical order in a string S, check that for each character index i in the string (except the first), if CC(S[i-1]) > CC(S[i]), then CC(S[i]) = 0. If this condition fails, the string is not in canonical order.

Annex 10: Intellectual Property

Transcript of letter regarding disclosure of IBM Technology
(Hard copy is on file with the Chair of UTC and the Chair of NCITS/L2)
Transcribed on 1998-03-10

February 26, 1999

 

The Chair, Unicode Technical Committee

Subject: Disclosure of IBM Technology - Unicode Normalization Forms

The attached document entitled "Unicode Normalization Forms" does not require IBM technology, but may be implemented using IBM technology that has been filed for US Patent. However, IBM believes that the technology could be beneficial to the software community at large, especially with respect to usage on the Internet, allowing the community to derive the enormous benefits provided by Unicode.

This letter is to inform you that IBM is pleased to make the Unicode normalization technology that has been filed for patent freely available to anyone using them in implementing to the Unicode standard.

Sincerely,

 

W. J. Sullivan,
Acting Director of National Language Support
and Information Development

 


Copyright

Copyright © 1998-1999 Unicode, Inc. All Rights Reserved.

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.