From: Michael D. Adams (mdmkolbe@gmail.com)
Date: Wed Feb 11 2009 - 20:33:20 CST
How do people efficiently implement the (re-)composition table used by
the normalization algorithm for NFC and NFKD? (I am writting a
library for a project.)
The most naive implementation would be a table indexed by a starter
character and a combining character. Of course that is completely
unreasonable as it would require 0x110000 * 0x110000 entries (a few
terabytes).
If I understand right, ICU library uses shared tries (as the Unicode
spec suggests) indexed by the starter character that point to lists of
combining character and result pairs (an association list in
Lisp/Scheme terminology). This should reduce the size requirements,
but now there a list we have to scan through which could increase
run-time access cost.
Are there any other implementation methods that have a small memory
footprint (~10-20kb) and quick access (~ 10-20 instructions)? Any
guidance in this regard would be appriciated.
Michael D. Adams
mdmkolbe@gmail.com
This archive was generated by hypermail 2.1.5 : Wed Feb 11 2009 - 20:37:42 CST