From: John Cowan (jcowan@reutershealth.com)
Date: Wed Oct 15 2003 - 11:00:17 CST
Jill Ramonsky scripsit:
> I had to write an API for my employer last year to handle some aspects
> of Unicode. We normalised everything to NFD, not NFC (but that's easier,
> not harder). Nonetheless, all the string handling routines were not
> allowed to /assume/ that the input was in NFD, but they had to guarantee
> that the output was. These routines, therefore, had to do a "convert to
> NFD" on every input, even if the input were already in NFD. This did
> have a significant performance hit, since we were handling (Unicode)
> strings throughout the app.
Indeed it would. However, checking for normalization is cheaper than
normalizing, and Unicode makes properties available that allow a streamlined
but incomplete check that returns "not normalized" or "maybe normalized".
So input can be handled as follows:
if maybeNormalized(input)
then if normalized(input)
then doTheWork(input)
else doTheWork(normalize(input))
fi
else doTheWork(normalize(input))
fi
The W3C recommends, however, that non-normalized input be rejected rather
than forcibly normalized, on the ground that the supplier of the input
is not meeting his contract.
> I think that next time I write a similar API, I wll deal with
> (string+bool) pairs, instead of plain strings, with the bool meaning
> "already normalised". This would definitely speed things up. Of course,
> for any strings coming in from "outside", I'd still have to assume they
> were not normalised, just in case.
W3C refers to this concept as "certified text". It's a good idea.
> Jill
>
-- Verbogeny is one of the pleasurettes John Cowan <jcowan@reutershealth.com> of a creatific thinkerizer. http://www.reutershealth.com -- Peter da Silva http://www.ccil.org/~cowan
This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST