From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Feb 23 2009 - 13:28:19 CST
> The worst performance would be (in the 1M character example I've been
> using), something like a base character followed by a list of 999,999
> characters with CCC != 0, sorted by CCC in reverse order. I added a note to
> this effect.
Actually, the worst performance would be a base character followed
by a list of 999,999 *non-BMP* characters with CCC != 0, because
then you'd miss all the BMP optimization, as well. (more time for
table lookup, more bytes to handle, surrogate-branch processing if
done in UTF-16, etc.)
That one *is* a potential heart-stopper for an implementation that
handles combining character reordering in a way optimized for
two- or three-character sequences.
One fairly cheap (and tunable) way to put a safety check in
a Unicode Normalization algorithm, if one is concerned about
this theoretical possibility (which would never occur in any
meaningful text data -- Mark's point -- but which *could*
in principle occur in malicious data if the only goal for it
was to mess with normalization implementations), would be
to put a governor count in the loop that does the search back
for a character with a lower or equal combining class, once
it is processing a combining character sequence. Having such
a count would require an extra increment and test on each
loop through, but that is in the uncommonly executed code
(for normal data) anyway. Then you can set the governor
limit to 1000 (or 100 or 24 or 12 or whatever you think
is a reasonable limit beyond which data can be certified
as malicious) and have the loop exit with a "String Not
Canonically Reorderable in Reasonable Time" error or whatever.
Yes, such a governor would mean that theoretically your
implementation wouldn't be conformant to UCA for the
million combining character sequence, but realistically,
who would care?
--Ken
--Ken
This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 13:30:18 CST