Re: Normalization Implementation Tricks

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Thu Feb 12 2009 - 12:52:47 CST

  • Next message: Asmus Freytag: "Re: Draft proposal for inclusion of the Chinook script in Unicode"

    For ICU, the trie also contains information on the trailing characters -
    whether they can combine with any previous character. Only if that is the
    case is a linear search done in the list of items that is specific to the
    starter. The list is in ascending order of trailing code points, so this
    favors starters for few composites, and composites with low code points
    (such as Latin/Greek/Cyrillic, in that order).

    The trie lookup is quite fast. Whether or not some other mechanism would
    have better performance than the linear search would need some careful
    testing. That is, when comparing the performance of any of the mechanisms
    having to to with Unicode, the expected frequency of characters (and in
    context) plays a huge role in how performance should be measured and tuned
    for. It may be better to take an O(n) algorithm over an O(1) algorithm
    depending on the constants involved.

    To take a simple example, for general-purpose software it is worth
    decreasing performance for non-BMP characters by 10x if you can increase
    performance with BMP characters by 1%, because the non-BMP characters are so
    extremely rare.

    Mark

    On Wed, Feb 11, 2009 at 18:33, Michael D. Adams <mdmkolbe@gmail.com> wrote:

    > How do people efficiently implement the (re-)composition table used by
    > the normalization algorithm for NFC and NFKD? (I am writting a
    > library for a project.)
    >
    > The most naive implementation would be a table indexed by a starter
    > character and a combining character. Of course that is completely
    > unreasonable as it would require 0x110000 * 0x110000 entries (a few
    > terabytes).
    >
    > If I understand right, ICU library uses shared tries (as the Unicode
    > spec suggests) indexed by the starter character that point to lists of
    > combining character and result pairs (an association list in
    > Lisp/Scheme terminology). This should reduce the size requirements,
    > but now there a list we have to scan through which could increase
    > run-time access cost.
    >
    > Are there any other implementation methods that have a small memory
    > footprint (~10-20kb) and quick access (~ 10-20 instructions)? Any
    > guidance in this regard would be appriciated.
    >
    > Michael D. Adams
    > mdmkolbe@gmail.com
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Feb 12 2009 - 12:56:07 CST