RE: FCD and Collation

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Tue, 12 Feb 2013 01:17:45 +0000

> Does anyone feel up to rigorously justifying revisions to the concepts
> and algorithms of FCD and canonical closure? Occasionally one will
> encounter cases where the canonical closure is infinite - in these
> cases, normalisation will be necessary regardless of the outcome of the
> FCD check.

Personally, no. One of the reasons I resisted incorporation of canonical enclosure in the basic UCA algorithm and in the DUCET table is because of its infinitesimal ROI. It complicates the table and its processing substantially, all in service of "fixing" edge cases of edge cases, which have to be dealt with in tailorings, anyway.

I think the current wording of Section 6.5 in UCA is appropriate as is. It doesn't say you must or should use FCD, but rather that you should do the right thing for strings that are in FCD, even if not normalizing. If that is hard or impossible for some edge case tailorings or for the weird (and deprecated) sequences in Tibetan, then those are the edge cases I am talking about which aren't worth handling in the basic algorithm.

>
> Perhaps one could merely revise the definition of FCD, and devise a test
> for the adequacy of the current canonical closure. If the collation
> fails this adequacy test, then again disabling normalisation should be
> prohibited. (I would suggest that in these cases the normalisation
> setting should be overridden with only the gentlest of chidings.)

FCD isn't part of the Unicode Standard, or of UCA, for that matter. It is an implementation optimization promulgated in ICU. So tweaking its definition would be a matter for ICU, in my opinion.

As regards the normalization on/off parameter, although UCA mentions it as a possible tailoring one could do, it goes no further. The details of a definition of a normalization on/off parameter belong now to LDML and the CLDR-TC, and to their use of it in defining locales. Personally, I think it should stay that way.

I don't doubt that there are real issues in some collation tailorings defined in CLDR (or prospective problems for tailorings that someone might want to *add* to CLDR), but the issues around those should be handled in the CLDR-TC, I think.

>
> A lazy option would be to wait (how long?) and then remove the option of no
> normalisation on the ground that sufficient computing power is
> available.

Unfortunately, I don't think that is ever going to be an option. This year, in 2013, I still know engineers who are busy tweaking code for speed in databases because the C or C++ library implementations of memmove() are not fast enough for their taste! Anything as time-critical as basic string comparisons in sorting is always going to attract attention for optimization.

--Ken

>
> Thoughts, anyone?
>
> Richard.
Received on Mon Feb 11 2013 - 19:20:42 CST

This archive was generated by hypermail 2.2.0 : Mon Feb 11 2013 - 19:20:43 CST