From: Daniel Ehrenberg (microdan@gmail.com)
Date: Thu Mar 15 2007 - 20:49:16 CST
Hi,
I'm working on adding Unicode support (possibly eventually conformace)
to an obscure programming language called Factor, which is sort of a
cross between Forth and Lisp (see factorcode.org for more
information). One thing that I'm doing is that all strings will always
be kept in Normalization Form D (as defined in UAX #15: Normalization
Forms) for processing. That way all canonically equivalent strings
return true when tested for equality. It wasn't difficult to implement
NFD (or NFKD); I just needed to read the transformations from
UnicodeData.txt and apply them recursively to get a hash table of
characters to canonical/compatability-decomposed strings. But for most
I/O purposes, I need to use NFC, re-composing all decomposed
characters. I have no idea how to do this efficiently. In many cases,
it's more complicated than just turning two adjacent characters into
one character.
I looked at both the Glib source (which defines basic unicode
operations) and the Normalizer demo that UAX 15 links to (which, btw,
only works properly for the BMP, which is bad). They both appear to
use generally the same strategy: perform as many pairwise compositions
on adjacent characters as possible. I wonder if I'm reading it wrong,
because if that's how it operates, then one of the examples in the UAX
wouldn't work properly: NFC(U+017F U+0323 U+0307) = U+1E9B U+0323.
This composes two non-adjacent characters. Is there any efficient way
to do this composition without messing up canonical ordering while
making sure to compose non-adjacent characters like this? It's an edge
case, I know, but I want my implementation to be correct.
In many places, the Unicode standard provides clues for
implementation, but I see none for NFC (or NFKC) and how to compose
characters. Can anyone help me?
Daniel Ehrenberg
This archive was generated by hypermail 2.1.5 : Thu Mar 15 2007 - 20:52:08 CST