Accumulated Feedback on PRI #191

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Mon Aug 8 11:21:20 CDT 2011
Contact: umavs@ca.ibm.com
Name: V.S. Umamaheswaran
Report Type: Public Review Issue
Opt Subject: 191 - UAX #15 - Normalization Forms
Forwarding a question from our Thai expert Sirilappanich (email: natta@th.ibm.com)
-----------

Regarding to NFC, NFD, NFKC and NFKD.

Thai Character  has following characteristic:
NFKC() = NFKD() =  
But
NFC() = NFD() = NFKC() = NFKD() = 
Should NFC() and NFD() produce  ?
By current logic, if text got normalized to either NFKC or NFKD, it is impossible to normalize back to NFC nor NFD except using external tool, e.g. regex.

Date/Time: Sun Sep 4 08:02:24 CDT 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Submission (FAQ, Tech Note, Case Study)
Opt Subject: UAX15-D4 (Stream-Safe Text Process) optimization


UAX#15 describes in rule D4 a "Stream-Safe Text Process" with the following algorithm:

_________

 1. If the input string is empty, return an empty output string.

 2. Set nonStarterCount to zero.

 3. For each code point C in the input string:

    3.a. Produce the NFKD decomposition S.

    3.b. If nonStarterCount plus the number of initial non-starters in S
	 is greater than 30, append a CGJ to the output string and set 
	the nonStarterCount to zero.

    3.c. Append C to the output string.

    3.d. If there are no starters in S, increment nonStarterCount by 
	the number of code points in S; otherwise, set nonStarterCount 
	to the number of trailing non-starters in S (which may be zero).

 4. Return the output string.

_________

But I note that this algorithm is not optimal because step 3.a.
requires getting though possibly multiple decomposition mappings, and
returning multiple characters, only to determine the rest of step 3,
which is most often driven only by the type of the first character in
C and not its NFKD decomposition S.

For example, in step 3.b, the "number of initial non-starters in S"
will most often be zero, and this will be the case only if S starts by
a starter, i.e. if C is itself a starter.

When C is a starter (even if it is decomposable), you will not insert
any CGJ, but will immediately insert C. Then only if isNFKD(C)==FALSE,
you'll have to get S in order to count the number of trailing non-
starters to set the new value of nonStarterCount.

When C is not a starter, it has a minimum composition class different
from zero, which is the composition class of the first character in
S=NFKD(C), and all other characters, if any, are also non-starters. In
other words, you don't need to check if all characters in S are
"trailing non-starters" in step 3.d, you just need to know how many
characters are in S. As this case occurs only with a few combining
marks C that are NFKD-decomposable, they have the property
isNFKD(C)==FALSE, and C also has a  (canonical or not) decomposition
mapping.

In the existing UCD, this seconde case only occurs with very few
decomposable combining marks with a non-zero combining class (for
example one Greek combining mark, tonos-dialitika, which is also
excluded from recomposition). You just need to access one
decomposition mapping to get its NFKC immediately.

So the algorithm can be optimized this way:

_________


 1. If the input string is empty, return an empty output string.

 2. Set nonStarterCount to zero.

 3. For each code point C in the input string:

    3.a. If C is a starter (this may be a CGJ for example) then:

         3.a.i.  Append C to the output string.

         3.a.ii. If isNFD(C)==TRUE, reset nonStarterCount to zero.
		Otherwise we have: isNFD(C)==FALSE (it cannot be MAYBE),
		so reset nonStarterCount to the number of trailing non-starters
		in NFKD(C) :
		The value to reset nonStarterCount after C can be precomputed 
		and retrieved from a small trie (whose default value if not found
		is zero and will never be more than two, for characters not
		excluded from composition, and not much more if the character is
		excluded; in fact in the current UCD, this cannot be more than 2
		in both cases).

  3.b. Otherwise C is a non-starter:

       3.b.i.	Determine trailingCount = 1 or 2 as follows:
		If isNFD(C)==MAYBE, then we know that NFKD(C)==NFD(C)==1,
		and C has no decomposition (other than itself): set trailingCount=1.
		Otherwise we have isNFD(C)==FALSE (it cannot be TRUE), and C has a
		canonical decomposition of 1 non-starter (a singleton: C is a
		compatibility combining character, excluded from recomposition) or
		of 2 non-starters (the only canonical decomposition mapping in the
		UCD which is not further decomposable): set trailingCount=1 or 2 
		accordingly.

      3.b.ii.	So if nonStartersCount plus trailings is above 30, output CGJ
		to the ouput string and reset nonStartCount=trailingCount (==1 or
		2 from step 3.b.i.).

      3.b.iii. Otherwise add trailingCount (==1 or 2 from step 3.b.i.) to nonStarterCount.

 4. Return the output string.

_________


Nowhere you'll need to effectively get the full decomposition string
NFKD(C), you just need, sometimes, to know its length as a very small
integer, to know where to insert a CGJ in long sequences of non-
trailers.