RE: Canonical equivalence in rendering: mandatory or recommended?

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Oct 15 2003 - 08:43:37 CST


Jill Ramonsky wrote:
> In my experience, there is a performance hit.
>
> I had to write an API for my employer last year to handle
> some aspects of Unicode. We normalised everything to NFD,
> not NFC (but that's easier, not harder). Nonetheless, all
> the string handling routines were not allowed to assume
> that the input was in NFD, but they had to guarantee that
> the output was. These routines, therefore, had to do a
> "convert to NFD" on every input, even if the input were
> already in NFD. This did have a significant performance
> hit, since we were handling (Unicode) strings throughout
> the app.
>
> I think that next time I write a similar API, I wll deal
> with (string+bool) pairs, instead of plain strings, with
> the bool meaning "already normalised". This would
> definitely speed things up. Of course, for any strings
> coming in from "outside", I'd still have to assume they
> were not normalised, just in case.

You could have split the NFD process in two separate steps:

1) Decomposition per se;

2) Reordering of combining classes.

You could have performed step 1 (which is presumably much heavier than 2)
only on strings coming from "outside", and step 2 at every passage.

In a further enhancement, step 2 could be called only upon operations which
could produce non-canonical order: e.g. when concatenating strings but not
when trimming them.

To gain even more speed, you could implement an ad-hoc version of step 2
which only operates on out-of order characters adjacent to a specified
location in the string (e.g., the joining point of a concatenation
operation).

Just my 0.02 euros.

_ Marco



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST