This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Mon Aug 8 11:21:20 CDT 2011Forwarding a question from our Thai expert Sirilappanich (email: natta@th.ibm.com) ----------- Regarding to NFC, NFD, NFKC and NFKD. Thai Character has following characteristic: NFKC() = NFKD() = But NFC() = NFD() = NFKC() = NFKD() = Should NFC() and NFD() produce ? By current logic, if text got normalized to either NFKC or NFKD, it is impossible to normalize back to NFC nor NFD except using external tool, e.g. regex.
Date/Time: Sun Sep 4 08:02:24 CDT 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Submission (FAQ, Tech Note, Case Study)
Opt Subject: UAX15-D4 (Stream-Safe Text Process) optimization
UAX#15 describes in rule D4 a "Stream-Safe Text Process" with the following algorithm: _________ 1. If the input string is empty, return an empty output string. 2. Set nonStarterCount to zero. 3. For each code point C in the input string: 3.a. Produce the NFKD decomposition S. 3.b. If nonStarterCount plus the number of initial non-starters in S is greater than 30, append a CGJ to the output string and set the nonStarterCount to zero. 3.c. Append C to the output string. 3.d. If there are no starters in S, increment nonStarterCount by the number of code points in S; otherwise, set nonStarterCount to the number of trailing non-starters in S (which may be zero). 4. Return the output string. _________ But I note that this algorithm is not optimal because step 3.a. requires getting though possibly multiple decomposition mappings, and returning multiple characters, only to determine the rest of step 3, which is most often driven only by the type of the first character in C and not its NFKD decomposition S. For example, in step 3.b, the "number of initial non-starters in S" will most often be zero, and this will be the case only if S starts by a starter, i.e. if C is itself a starter. When C is a starter (even if it is decomposable), you will not insert any CGJ, but will immediately insert C. Then only if isNFKD(C)==FALSE, you'll have to get S in order to count the number of trailing non- starters to set the new value of nonStarterCount. When C is not a starter, it has a minimum composition class different from zero, which is the composition class of the first character in S=NFKD(C), and all other characters, if any, are also non-starters. In other words, you don't need to check if all characters in S are "trailing non-starters" in step 3.d, you just need to know how many characters are in S. As this case occurs only with a few combining marks C that are NFKD-decomposable, they have the property isNFKD(C)==FALSE, and C also has a (canonical or not) decomposition mapping. In the existing UCD, this seconde case only occurs with very few decomposable combining marks with a non-zero combining class (for example one Greek combining mark, tonos-dialitika, which is also excluded from recomposition). You just need to access one decomposition mapping to get its NFKC immediately. So the algorithm can be optimized this way: _________ 1. If the input string is empty, return an empty output string. 2. Set nonStarterCount to zero. 3. For each code point C in the input string: 3.a. If C is a starter (this may be a CGJ for example) then: 3.a.i. Append C to the output string. 3.a.ii. If isNFD(C)==TRUE, reset nonStarterCount to zero. Otherwise we have: isNFD(C)==FALSE (it cannot be MAYBE), so reset nonStarterCount to the number of trailing non-starters in NFKD(C) : The value to reset nonStarterCount after C can be precomputed and retrieved from a small trie (whose default value if not found is zero and will never be more than two, for characters not excluded from composition, and not much more if the character is excluded; in fact in the current UCD, this cannot be more than 2 in both cases). 3.b. Otherwise C is a non-starter: 3.b.i. Determine trailingCount = 1 or 2 as follows: If isNFD(C)==MAYBE, then we know that NFKD(C)==NFD(C)==1, and C has no decomposition (other than itself): set trailingCount=1. Otherwise we have isNFD(C)==FALSE (it cannot be TRUE), and C has a canonical decomposition of 1 non-starter (a singleton: C is a compatibility combining character, excluded from recomposition) or of 2 non-starters (the only canonical decomposition mapping in the UCD which is not further decomposable): set trailingCount=1 or 2 accordingly. 3.b.ii. So if nonStartersCount plus trailings is above 30, output CGJ to the ouput string and reset nonStartCount=trailingCount (==1 or 2 from step 3.b.i.). 3.b.iii. Otherwise add trailingCount (==1 or 2 from step 3.b.i.) to nonStarterCount. 4. Return the output string. _________ Nowhere you'll need to effectively get the full decomposition string NFKD(C), you just need, sometimes, to know its length as a very small integer, to know where to insert a CGJ in long sequences of non- trailers.