From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Sep 19 2007 - 19:15:40 CDT
Jonathan Pool wrote:
> In my work on another of these applications, I'm tentatively planning to
> normalize all input to NFKD. I'm concerned, though, that (1) some valid
> distinctions might thereby be erased, (2) some invalid distinctions may
> survive, and (3) some user agents may misrender decomposed strings.
Regarding point (1), there's nothing that can be said about which
distinction is enough significant so that its erasing using NFKD or NFKC
would constitute a problem. All this depends on the source data you have
collected and the conventions these sources were effectively using.
The best you can do is to try comparing, for each data source you have, the
differences generated between NFKD and NFD forms, or equivalently between
NFKC and NFC forms.
Regarding point (3), it should not be a problem. If you fear that some
agents will not render NFKD correctly, use NFKD because it is
algorithmically equivalent.
So the good question to ask your self is not between NFKD and NFD or between
NFKC and NFC (as the composed NF?C forms are always resulting from the
automated recomposition of decomposed NF?D forms in an intermediate step of
the algorithm, using only the canonical equivalences in the UCD and the
associated list of composition exclusions).
Note that using NFKC or NFKD, you will loose some distinctions between
characters that are not canonically equivalent (unlike with NFC and NFD).
Some of these distinctions are necessary for some types of texts, notably:
* mathematical texts using font style distinctions
* Asian composed symbols that will become simple strings
* digits in superscript/subscript that will become normal digits, a problem
in strings like "<number1><superscript number2>" where this will become
"<number1><number2>" and interpreted as a different number. I suggest
keeping the superscript/subscript distinctions, at least by converting them
into some upper-level rich-text format if you can (HTML, XML, RTF...) and
your target is to produce rich text documents from your unified corpus.
In a first step, you opt to preserve all the distinctions, but still get
documents not using any compatible characters, i.e. texts using a NFKD or
NFKD form (that are also in NFC or NFD form by definition). I tend to think
that, given your point (3), NFKC will be a better choice for you.
This means that the converted corpus should use some rich-text format,
compatible with many other tools for further treatments (like full text
search engines that are now tuned to process very well XML/HTML like
documents). This means that you may need a document model for your corpus to
allow keeping some optional distinctions at least with out-of-band
annotations. Such document models already exist : HTML is one common choice,
but there are others like ODF and DocBook, ...
This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 19:18:04 CDT