RE: Normalization Implementation Tricks

From: Richard Ishida (ishida@w3.org)
Date: Thu Feb 12 2009 - 09:12:09 CST

  • Next message: Mark Davis: "Re: Normalization Implementation Tricks"

    Michael,

    You can look at my source code for normalization in PHP or JavaScript at
    http://rishida.net/blog/?p=222 , if that's any help.

    RI

    ============
    Richard Ishida
    Internationalization Lead
    W3C (World Wide Web Consortium)

    http://www.w3.org/International/
    http://rishida.net/

    > -----Original Message-----
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    > On Behalf Of Michael D. Adams
    > Sent: 12 February 2009 02:33
    > To: unicode@unicode.org
    > Subject: Normalization Implementation Tricks
    >
    > How do people efficiently implement the (re-)composition table used by
    > the normalization algorithm for NFC and NFKD? (I am writting a
    > library for a project.)
    >
    > The most naive implementation would be a table indexed by a starter
    > character and a combining character. Of course that is completely
    > unreasonable as it would require 0x110000 * 0x110000 entries (a few
    > terabytes).
    >
    > If I understand right, ICU library uses shared tries (as the Unicode
    > spec suggests) indexed by the starter character that point to lists of
    > combining character and result pairs (an association list in
    > Lisp/Scheme terminology). This should reduce the size requirements,
    > but now there a list we have to scan through which could increase
    > run-time access cost.
    >
    > Are there any other implementation methods that have a small memory
    > footprint (~10-20kb) and quick access (~ 10-20 instructions)? Any
    > guidance in this regard would be appriciated.
    >
    > Michael D. Adams
    > mdmkolbe@gmail.com



    This archive was generated by hypermail 2.1.5 : Thu Feb 12 2009 - 09:16:56 CST