RE: Normalization in panlingual application

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Sep 20 2007 - 17:50:54 CDT

  • Next message: Philippe Verdy: "RE: Normalization in panlingual application"

    Mis-attribution, Asmus. I've not written or discussed and quoted what you
    are repeating here. You must have mixed messages written by others....

    > -----Message d'origine-----
    > De : Asmus Freytag [mailto:asmusf@ix.netcom.com]
    > Envoyé : jeudi 20 septembre 2007 14:07
    > À : verdy_p@wanadoo.fr
    > Cc : 'Jonathan Pool'; unicode@unicode.org
    > Objet : Re: Normalization in panlingual application
    >
    > On 9/19/2007 6:04 PM, Philippe Verdy wrote:
    > > Asmus Freytag [mailto:asmusf@ix.netcom.com] wrote:
    > >
    > >> You realize, also, that it is not (in the general case) possible to
    > >> apply normalization piece-meal. Because of that, breaking the text into
    > >> runs and then normalizing can give different results (in some cases),
    > >> which makes pre-processing a dicey option.
    > >>
    > >
    > > That's not my opinion.
    > The result that for many strings s and t, NFxx(s) + NFxx(t) != NFxx(s +
    > t) is not a matter of opinion. For these strings, you cannot normalize
    > them separately and then concatenate, and expect the result to be the
    > normalized from of the two strings. UAX#15 is rather clear about that.
    > > At least the first step of the conversion (converting
    > > to NFC) is very safe and preserves differences, using standard programs
    > > (which are widely available, so this step represents norisk). Detecting
    > > compatible characters and mapping them to annoted forms can be applied
    > after
    > > this step in a very straightforward thing.
    > I had written:
    > > > Since none of the common libraries that implement normalization forms
    > > > perform the necessary mappings to markup out of the box, anyone
    > > > contemplating such a scheme would be forced to implement either a
    > > > pre-processing step, or their own normalization logic. This is a
    > > > downright scary suggestion, since such an approach would lose the
    > > > benefit of using well-tested implementation. Normalization is tricky
    > > > enough that one should try to not implement if from scratch if all
    > > > possible.
    > your approach confirms what I suspect. By suggesting an approach like
    > this, you are advocating de-novo implementation of normalization
    > transformation. By the way, NFC would be a poor starting point for your
    > scheme, since all normalization forms start with an (implied) first step
    > of applying *de*composition. But you can't even start with NFD, since
    > the minute you decompose any compatibility characters in your following
    > step, you can in principle create sequences that denormalize the
    > existing NFD string around it. The work to handle these exceptions,
    > amounts to a full implementation of normalization, logically speaking.
    > In other words, you've lost the benefit of your library.



    This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 21:02:15 CDT