Re: UAX #15 Unicode Normalization Forms, D4 Primary Combined

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Dec 22 2008 - 20:51:22 CST


Russell Shaw asked:

> On P.1344 of Unicode 5.0, it has D4:
>
> A character X can be "primary combined" with a character Y if and only
> if there is a primary composite Z that is canonically equivalent to the
> sequence <X,Y>.
>
>
> Which tables in http://www.unicode.org/Public/UNIDATA/
> should i use to do this?

All of the decomposition (and composition) data needed
is in the Decomposition_Mapping field of UnicodeData.txt.

For example, to take a simple case, a-acute. The entry
in UnicodeData.txt is:

00E1;LATIN SMALL LETTER A WITH ACUTE;Ll;0;L;0061 0301;...

That defines the canonical *decomposition* of U+00E1
as <0061, 0301>.

Now to go the other way, there is a primary composite U+00E1
that is canonically equivalent to the sequence <0061, 0301>,
so if you are doing canonical composition, then you replace
<0061, 0301> by 00E1.

And so on. Of course, for certain combinations of combining
marks, things can get more complicated.

See UCD.html for documentation of the individual fields of
UnicodeData.txt.

Actually, for Hangul and Jamo characters for Korean, you need
to augment the data in UnicodeData.txt with the algorithm
described in Section 3.12, Conjoining Jamo Behavior, of
the standard.

Implementing Unicode normalization from scratch is not for
the faint of heart. ;-) My advice, if possible, is simply to
rely on an established and debugged API provided for
Unicode normalization, out of an existing library such as
ICU:

http://www.icu-project.org/

--Ken



This archive was generated by hypermail 2.1.5 : Fri Jan 02 2009 - 15:33:07 CST