Markus W. Scherer
2001-oct-07
Editorial Notes 2001-oct-08:
Form "Fast C or D" was developed for collation and similar string
processing operations to work with most strings without normalizing them.
Algorithms — like collation and string search — that treat all canonically
equivalent forms of a string the same can be easily implemented by first
canonically decomposing (NFD) each input string. Since there will be no
precomposed characters after that, the implementation of the algorithm can omit
properties for many characters from its data.
However, if an implementation does include equivalent data for both precomposed
and decomposed forms, then it can avoid the normalization step in most cases.
It will work properly without NFD normalization if input strings are in form
FCD.
Definition: A string is in FCD if the concatenation of the canonical decompositions (NFD) of each of its characters is canonically ordered.
FCD has the following properties:
Testing for FCD (pseudo-code):
boolean isFCD(String s) {
String d; // holds decompositions
UChar32 c; // code point value
uint8_t prevCC=0, leadCC, trailCC; // variables for combining classes
for each Unicode code point c in s {
d=NFD(c);
leadCC=getCombiningClass(first code point in d);
trailCC=getCombiningClass(last code point in d);
if(leadCC!=0 && leadCC<prevCC) {
return false;
}
prevCC=trailCC;
}
}
This test can be made extremely fast by storing the precomputed leadCC and trailCC for each Unicode code point in a table. Further optimizations include:
c<0xc0, c<0x300, leadCC(c)==0 &&
trailCC(c)==0
. Only if either the leadCC or trailCC is not
zero does the canonical order need to be checked. (Fast for CJK
ideographs, Korean, etc.) Form "Fast C or D" was developed and first documented by Mark E. Davis in 2000 for the collation implementation in ICU 1.8. See the ICU collation design document for the original description, and the ICU normalization implementation for a test function (unorm_checkFCD()) and a "makeFCD" function (unorm_makeFCD()).