RE: Normalization in panlingual application

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Sep 19 2007 - 19:15:40 CDT

  • Next message: Philippe Verdy: "RE: Normalization in panlingual application"

    Jonathan Pool wrote:
    > In my work on another of these applications, I'm tentatively planning to
    > normalize all input to NFKD. I'm concerned, though, that (1) some valid
    > distinctions might thereby be erased, (2) some invalid distinctions may
    > survive, and (3) some user agents may misrender decomposed strings.

    Regarding point (1), there's nothing that can be said about which
    distinction is enough significant so that its erasing using NFKD or NFKC
    would constitute a problem. All this depends on the source data you have
    collected and the conventions these sources were effectively using.

    The best you can do is to try comparing, for each data source you have, the
    differences generated between NFKD and NFD forms, or equivalently between
    NFKC and NFC forms.

    Regarding point (3), it should not be a problem. If you fear that some
    agents will not render NFKD correctly, use NFKD because it is
    algorithmically equivalent.

    So the good question to ask your self is not between NFKD and NFD or between
    NFKC and NFC (as the composed NF?C forms are always resulting from the
    automated recomposition of decomposed NF?D forms in an intermediate step of
    the algorithm, using only the canonical equivalences in the UCD and the
    associated list of composition exclusions).

    Note that using NFKC or NFKD, you will loose some distinctions between
    characters that are not canonically equivalent (unlike with NFC and NFD).
    Some of these distinctions are necessary for some types of texts, notably:
    * mathematical texts using font style distinctions
    * Asian composed symbols that will become simple strings
    * digits in superscript/subscript that will become normal digits, a problem
    in strings like "<number1><superscript number2>" where this will become
    "<number1><number2>" and interpreted as a different number. I suggest
    keeping the superscript/subscript distinctions, at least by converting them
    into some upper-level rich-text format if you can (HTML, XML, RTF...) and
    your target is to produce rich text documents from your unified corpus.

    In a first step, you opt to preserve all the distinctions, but still get
    documents not using any compatible characters, i.e. texts using a NFKD or
    NFKD form (that are also in NFC or NFD form by definition). I tend to think
    that, given your point (3), NFKC will be a better choice for you.

    This means that the converted corpus should use some rich-text format,
    compatible with many other tools for further treatments (like full text
    search engines that are now tuned to process very well XML/HTML like
    documents). This means that you may need a document model for your corpus to
    allow keeping some optional distinctions at least with out-of-band
    annotations. Such document models already exist : HTML is one common choice,
    but there are others like ODF and DocBook, ...



    This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 19:18:04 CDT