Re: Compatibility Casefold Equivalence from Asmus Freytag via Unicode on 2018-11-24 (Unicode Mail List Archive)

From: Asmus Freytag via Unicode <unicode_at_unicode.org>
Date: Sat, 24 Nov 2018 14:33:15 -0800

On 11/22/2018 11:58 AM, Carl via Unicode wrote:

(It looks like my HTML email got scrubbed, sorry for the double post)

Hi,


In Chapter 3 Section 13, the Unicode spec defines D146:


"A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))"


I am trying to understand the "if and only if" part of this.   Specifically, why is the outermost NFKD necessary?  Could it also be a NFKC normalization?   Is wrapping the outer NFKD in a NFC or NFKC on both sides of the equation okay?


My use case is that I am trying to store user-provided tags in a database.  I would like the tags to be deduplicated based on compatibility and caseless equivalence, which is how I ended up looking at D146.  However, because decomposition can result in much larger strings, I would prefer to keep  the stored version in NFC or NFKC (I *think* this doesn't matter after doing the casefolding as described above).

Carl,

you may find that some of the complications are limited to a small number of code points. In particular, classical (polytonic) Greek has some gnarly behavior wrt case; and some compatibility characters have odd edge cases.

I'm personally not a fan of allowing every single Unicode code point in things like usernames (or other types of identifiers). Especially, if including some code points makes the "general case" that much more complex, my personal recommendation would be to simply disallow / reject a small set of troublesome characters; especially if they aren't part of some widespread modern orthography.

While Unicode is about being able to digitally represent all written text, identifiers don't follow the same rules. The main reason why people often allow "anything" is because it's easy in terms of specification. Sometimes, you may not have control over what to accept; for example if tags are generated from headers in a document, it would require some transform to handle disallowed code points.

Case is also only one of the types of duplication you may encounter. In many South and South East Asian scripts you may encounter cases where two sequences of characters, while different, will normally render identical. Arabic also has instances of that. Finally, you may ask yourself whether your system should treat simplified and traditional Chinese ideographs as separate or as a variant not unlike the way you treat case.

About storing your tag data: you can obviously store them as NFC, if you like: in that case, you will have to run the operations both on the stored and on the new tag.

Finally, there are some cases where you can tell that two string are identical without actually carrying out the full set of operations:

Y = X

NFC(Y) = NFC(X)

and so on. (If these conditions are true, the full condition above must also be true). For example, let's apply

NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))

on both sides of

NFC(Y) = NFC(X)

First:

NFD(NFC(Y)) = NFD(NFC(X))

Because the two sides are equal, applying toCaseFold results in equal strings, and so on all the way to the outer NFKD.

In other words, you can stop the comparison at any point where the two sides are equal. From that point on, the outer operations cannot add anything.

A./

Received on Sat Nov 24 2018 - 16:33:30 CST

This archive was generated by hypermail 2.2.0 : Sat Nov 24 2018 - 16:33:31 CST