From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Sep 19 2007 - 19:21:03 CDT
Asmus Freytag wrote:
> What is an invalid distinction is defined by your application. If you
> case-fold, case is an invalid distinction. If your goal is to be able to
> represent text faithfully, then the "K" series of normalizations has no
> place in your design (It's too haphazard - for example, also, 5¼ would
> be turned into 51/4, which is decidedly not the same thing).
This is only a problem is the converted text has to be plain-text. If the
target of the project is to allow building rich-text documents from the
corpus, then conversion using NFKC becomes possible, for example using some
XML annotation, the conversion would be something like
"5<fraction>1/4</fraction>" and will in NFKC form (also in NFC form by
definition, as well as, here, in NFC and NFD forms).
Such annotation changes the nature of texts, by turning them not in linear
suites of characters by mapping a structure on top of this. Given the
expected usage, parsing texts to mapa structure on top of them will probably
be a bonus, as it will ease later reuse of the converted corpus, within
different contexts.
This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 19:23:03 CDT