Re: Normalization Implementation Tricks

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Thu Feb 12 2009 - 12:52:47 CST

Next message: Asmus Freytag: "Re: Draft proposal for inclusion of the Chinook script in Unicode"

Previous message: Richard Ishida: "RE: Normalization Implementation Tricks"
In reply to: Michael D. Adams: "Normalization Implementation Tricks"
Next in thread: Markus Scherer: "Re: Normalization Implementation Tricks"
Reply: Markus Scherer: "Re: Normalization Implementation Tricks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

For ICU, the trie also contains information on the trailing characters -
whether they can combine with any previous character. Only if that is the
case is a linear search done in the list of items that is specific to the
starter. The list is in ascending order of trailing code points, so this
favors starters for few composites, and composites with low code points
(such as Latin/Greek/Cyrillic, in that order).

The trie lookup is quite fast. Whether or not some other mechanism would
have better performance than the linear search would need some careful
testing. That is, when comparing the performance of any of the mechanisms
having to to with Unicode, the expected frequency of characters (and in
context) plays a huge role in how performance should be measured and tuned
for. It may be better to take an O(n) algorithm over an O(1) algorithm
depending on the constants involved.

To take a simple example, for general-purpose software it is worth
decreasing performance for non-BMP characters by 10x if you can increase
performance with BMP characters by 1%, because the non-BMP characters are so
extremely rare.

Mark

On Wed, Feb 11, 2009 at 18:33, Michael D. Adams <mdmkolbe@gmail.com> wrote:

> How do people efficiently implement the (re-)composition table used by
> the normalization algorithm for NFC and NFKD? (I am writting a
> library for a project.)
>
> The most naive implementation would be a table indexed by a starter
> character and a combining character. Of course that is completely
> unreasonable as it would require 0x110000 * 0x110000 entries (a few
> terabytes).
>
> If I understand right, ICU library uses shared tries (as the Unicode
> spec suggests) indexed by the starter character that point to lists of
> combining character and result pairs (an association list in
> Lisp/Scheme terminology). This should reduce the size requirements,
> but now there a list we have to scan through which could increase
> run-time access cost.
>
> Are there any other implementation methods that have a small memory
> footprint (~10-20kb) and quick access (~ 10-20 instructions)? Any
> guidance in this regard would be appriciated.
>
> Michael D. Adams
> mdmkolbe@gmail.com
>
>

Next message: Asmus Freytag: "Re: Draft proposal for inclusion of the Chinook script in Unicode"
Previous message: Richard Ishida: "RE: Normalization Implementation Tricks"
In reply to: Michael D. Adams: "Normalization Implementation Tricks"
Next in thread: Markus Scherer: "Re: Normalization Implementation Tricks"
Reply: Markus Scherer: "Re: Normalization Implementation Tricks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Feb 12 2009 - 12:56:07 CST