From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Sep 20 2007 - 06:49:22 CDT
On 9/19/2007 7:01 PM, Philippe Verdy wrote:
> Asmus Freytag wrote:
>
>> The "K" series of normalization forms, by default, just "k"orrupt the
>>
> data.
>
> I do agree with that,
Glad to hear that.
> but I wonder why NFKC/NFKD have been integrated within
> the standard for conformance, given that it causes many wellknown problems.
>
Some of the problems, I'm sure, were not well known in advance. Like the
compatibility decompositions that these forms are based on, their main
field of applicability was seen in identifier matching, a realm that
traditionally supports only a subset of ordinary language.
> It should at best have been just a non-mandatory recommendation, allowing
> tailoring (even IDN no longer refers to it directly, and needed to redefine
> its own foldings).
>
That's because IDN is morphing beyond simple identifiers as
traditionally understood for programming languages and the like. IDN is
attempting to be closer to ordinary language, and that's why the
limitations of NFKD/NFKC become apparent.
> Anyway, why is NFKD/NFKC frozen as well as the compatibility mappings in the
> UCD? Making these unmutable with the stability policy was not necessary. For
> me, such mappings in the UCD are just informative, to document why these
> compatibility characters were also encoded separately and how they differ
> from the other characters referenced by the mapping.
>
If you offer a specification, it's always useful to not allow options.
Every option multiplies the set of legal, or valid mappings between
input and output. Multiple options exponentially increase that set. With
it, you not only increase the implementation and testing effort, but you
increase the chance that two parties in an interchange do not support a
compatible set of options.
Normalization is about making interchange more reliable, by removing
options. For example, applying NFD removes precomposed characters,
reducing the number of ways in which the same information can be
encoded. Adding options to the normalization forms protocols undoes one
of their major benefits for reliable interchange.
> NFKC/NFKD forms should have been specified like other foldings, even less
> normative than case mappings/foldings (that cause much less problems). It's
> not even goodasit does not preserve linguistic differences, generates severe
> corruption of texts, and is not tailorable.
>
>
A better way to say this is that for many implementations, trying to use
*normalization* to deal with the issue of compatibility characters and
other ignorable, though real, distinctions is not the best way. The
Unicode Consortium realized this several years ago when the work on
UTS#30 on character foldings was begun. That is the direction in which
implementers should turn.
The existing NFKC/NFKD should be limited to specific uses in the context
of identifier matching, for which they were originally intended.
It would be very counterproductive to pursue the discussion as if the
goal was to improve these forms in the context of *normalization*. That
would be a giant step backwards, in fact, it would negate the more
recent developments of a framework that sees *character foldings* as the
core aspect -- such a new framework, for the first time, is also able to
seamlessly integrate case folding. Something *normalization* cannot, and
should not do.
Insofar as you current message is framed as if it was trying to argue
for an improvement of NFKC/NFKD it is therefore doing a disservice, by
distracting from the more productive, and more recent developments.
A./
>
>
>
>
This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 06:52:26 CDT