Re: NFC FAQ

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Feb 23 2009 - 13:28:19 CST

  • Next message: Kenneth Whistler: "Re: NFC FAQ"

    > The worst performance would be (in the 1M character example I've been
    > using), something like a base character followed by a list of 999,999
    > characters with CCC != 0, sorted by CCC in reverse order. I added a note to
    > this effect.

    Actually, the worst performance would be a base character followed
    by a list of 999,999 *non-BMP* characters with CCC != 0, because
    then you'd miss all the BMP optimization, as well. (more time for
    table lookup, more bytes to handle, surrogate-branch processing if
    done in UTF-16, etc.)

    That one *is* a potential heart-stopper for an implementation that
    handles combining character reordering in a way optimized for
    two- or three-character sequences.

    One fairly cheap (and tunable) way to put a safety check in
    a Unicode Normalization algorithm, if one is concerned about
    this theoretical possibility (which would never occur in any
    meaningful text data -- Mark's point -- but which *could*
    in principle occur in malicious data if the only goal for it
    was to mess with normalization implementations), would be
    to put a governor count in the loop that does the search back
    for a character with a lower or equal combining class, once
    it is processing a combining character sequence. Having such
    a count would require an extra increment and test on each
    loop through, but that is in the uncommonly executed code
    (for normal data) anyway. Then you can set the governor
    limit to 1000 (or 100 or 24 or 12 or whatever you think
    is a reasonable limit beyond which data can be certified
    as malicious) and have the loop exit with a "String Not
    Canonically Reorderable in Reasonable Time" error or whatever.

    Yes, such a governor would mean that theoretically your
    implementation wouldn't be conformant to UCA for the
    million combining character sequence, but realistically,
    who would care?

    --Ken

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 13:30:18 CST