L2/01-371

Normalization Form FCD

Markus W. Scherer
2001-oct-07

Editorial Notes 2001-oct-08:

This document was prepared for discussion at the UTC meeting in November 2001.
The goal is to include such text in UAX 15 (Normalization) and to mention it in UTS 10 (Collation).
Note that although the topic is Normalization, FCD is not strictly a "Normalization Form" because it lacks uniqueness. Possible changes in language to avoid confusion may need to be discussed.
Mark mentioned recently that there might be better names for this form than "FCD".

Form "Fast C or D" was developed for collation and similar string processing operations to work with most strings without normalizing them.
Algorithms — like collation and string search — that treat all canonically equivalent forms of a string the same can be easily implemented by first canonically decomposing (NFD) each input string. Since there will be no precomposed characters after that, the implementation of the algorithm can omit properties for many characters from its data.
However, if an implementation does include equivalent data for both precomposed and decomposed forms, then it can avoid the normalization step in most cases. It will work properly without NFD normalization if input strings are in form FCD.

Definition: A string is in FCD if the concatenation of the canonical decompositions (NFD) of each of its characters is canonically ordered.

FCD has the following properties:

A string in NFD is also in FCD.
There may be many canonically equivalent forms of a string that are all in FCD.
Most NFC strings are in FCD.
Texts in many languages are always in FCD because they use at most one combining character (with non-zero combining class) at a time.
Testing for FCD can be very fast.
It is possible to efficiently generate an FCD form of a string in fewer steps than an NFD form.

Testing for FCD (pseudo-code):

boolean isFCD(String s) {

    String d;                           // holds decompositions

    UChar32 c;                          // code point value

    uint8_t prevCC=0, leadCC, trailCC;  // variables for combining classes

    for each Unicode code point c in s {

        d=NFD(c);

        leadCC=getCombiningClass(first code point in d);

        trailCC=getCombiningClass(last code point in d);

        if(leadCC!=0 && leadCC<prevCC) {

            return false;

        prevCC=trailCC;

This test can be made extremely fast by storing the precomputed leadCC and trailCC for each Unicode code point in a table. Further optimizations include:

All code points below U+00c0 (US-ASCII plus Latin-1 punctuation) have zero leadCC and trailCC. Their combining classes do not need to be looked up. (Fast for US-ASCII.)
All code points below U+0300 have zero leadCC. Their combining classes only need to be looked up if code points of or above U+0300 follow. (Fast for all Latin texts.)
A fast and very small loop can handle the simple cases of c<0xc0, c<0x300, leadCC(c)==0 && trailCC(c)==0. Only if either the leadCC or trailCC is not zero does the canonical order need to be checked. (Fast for CJK ideographs, Korean, etc.)
In UTF-16, most of the test can operate on 16-bit code units. Only the few lead surrogates for supplementary characters that have non-zero leadCC/trailCC need to be marked in the table to be able to leave the fast loop. This avoids testing for surrogates in the fast loop, and avoids assembling 21-bit code point values in almost all cases.
Trail surrogates and all other lead surrogates are marked with zeroes in the lookup table. In Unicode 3.1, there is only one lead surrogate that has a non-zero value. (Fast for BMP characters and all supplementary characters with leadCC==trailCC==0.)

Form "Fast C or D" was developed and first documented by Mark E. Davis in 2000 for the collation implementation in ICU 1.8. See the ICU collation design document for the original description, and the ICU normalization implementation for a test function (unorm_checkFCD()) and a "makeFCD" function (unorm_makeFCD()).