Re: AbstractCharacter class

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Fri Aug 01 1997 - 14:49:01 EDT


On Mon, 28 Jul 1997, John Cowan wrote:

> My original design allowed for a method which would return a
> fully reduced representation of an AbstractCharacter into as
> few characters as possible, by taking advantage of Hangul
> syllable characters and precomposed characters.
>
> But taking a base character followed by arbitrarily many
> combining characters and finding the shortest possible sequence
> (involving a new base character and fewer combining characters)
> turns out to be hard, because of the Canonical Ordering Algorithm.
>
> Thus LATIN CAPITAL LETTER O plus COMBINING DOT BELOW plus
> COMBINING CIRCUMFLEX BELOW plus COMBINING CIRCUMFLEX (to make
> up an example) can be reduced to LATIN CAPITAL LETTER O WITH
> CIRCUMFLEX AND DOT BELOW (U+1ED8) plus COMBINING CIRCUMFLEX BELOW,
> but if DOT BELOW comes after CIRCUMFLEX BELOW, the shortest reduction
> is to LATIN CAPITAL LETTER O WITH CIRCUMFLEX plus COMBINING DOT
> BELOW plus COMBINING CIRCUMFLEX BELOW.
>
> After messing with many ugly and inefficient algorithms, I
> would like input on the following partial strategy:
>
> If a base+combining sequence is EXACTLY equivalent
> to ONE precomposed character, reduce it to the single
> character. Otherwise, do not reduce it at all.
>
> The argument for this is that reduction to compatibility
> characters will be done for interoperation with Level 1-type
> systems that can't cope with combining characters, not for
> mere compression. If there's no way to weed out all the
> combining characters, the recipient must be prepared to cope
> with some, and if some, why not all?

I made another proposal, for a somewhat different purpose, in

ftp://ftp.ifi.unizh.ch/outgoing/draft-duerst-i18n-norm-00.txt

It consists, in addition to exact matches, to also include
the cases where the cannonically ordered decomposed representation
of a precomposed character is an initial substring of the
decomposed representation of the character (if there are more
than one, it's the longest that is used).

I have also specified normalization for Hangul Jamo. Any
comments are wellcome.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT