AbstractCharacter class

From: John Cowan (john_cowan@hotmail.com)
Date: Mon Jul 28 1997 - 17:35:51 EDT


Due to network problems, I can read mail at cowan@ccil.org, but
can't post/reply/send from there. Please direct all replies to
cowan@ccil.org, not the HotMail address. Thanks.

Some weeks ago I posted about my design for an AbstractCharacter
class for Java, which would allow processing Unicode data in
high-level chunks (based on the "Character Boundaries" algorithm
at p. 5-21f. of TUS2.0.

My original design allowed for a method which would return a
fully reduced representation of an AbstractCharacter into as
few characters as possible, by taking advantage of Hangul
syllable characters and precomposed characters.

But taking a base character followed by arbitrarily many
combining characters and finding the shortest possible sequence
(involving a new base character and fewer combining characters)
turns out to be hard, because of the Canonical Ordering Algorithm.

Thus LATIN CAPITAL LETTER O plus COMBINING DOT BELOW plus
COMBINING CIRCUMFLEX BELOW plus COMBINING CIRCUMFLEX (to make
up an example) can be reduced to LATIN CAPITAL LETTER O WITH
CIRCUMFLEX AND DOT BELOW (U+1ED8) plus COMBINING CIRCUMFLEX BELOW,
but if DOT BELOW comes after CIRCUMFLEX BELOW, the shortest reduction
is to LATIN CAPITAL LETTER O WITH CIRCUMFLEX plus COMBINING DOT
BELOW plus COMBINING CIRCUMFLEX BELOW.

After messing with many ugly and inefficient algorithms, I
would like input on the following partial strategy:

    If a base+combining sequence is EXACTLY equivalent
    to ONE precomposed character, reduce it to the single
    character. Otherwise, do not reduce it at all.

The argument for this is that reduction to compatibility
characters will be done for interoperation with Level 1-type
systems that can't cope with combining characters, not for
mere compression. If there's no way to weed out all the
combining characters, the recipient must be prepared to cope
with some, and if some, why not all?

What do the assembled Unicoders think of that idea?

-- 
John Cowan                       cowan@ccil.org
        Please do not use "Reply"
        e'osai ko sarji la lojban.
______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT