RE: johab compound letters reference for Hangul? (3)

From: Philippe Verdy (
Date: Sun Dec 21 2003 - 18:03:06 EST

  • Next message: Michael Everson: "Re: Aramaic unification and information retrieval"

    Kent Karlsson wrote:
    > Philippe Verdy wrote:
    > ...
    > > Here is what I have (this is just the part related to Hangul
    > > jamos in the Johab set), presented in collation order:
    > > # add canonical de/recomposition of "Johab" compound leading
    > consonnant jamos in Hangul
    > > # (there are 17 basic consonnants) in Hangul, IEUNG is used for
    > > #1100;HANGUL CHOSEONG KIYEOK;Lo;0;L;;;;;N;;G;;;
    > > 1101;HANGUL CHOSEONG SSANGKIYEOK;Lo;0;L;<johab> 1100 1100;;;;N;;GG;;;
    > ...
    > When possible, I've preferred the "left associative" reading, just
    > to make it easier for the recomposition. I don't thing there is any
    > linguistic reason for prefering the "right associative" reading for
    > any of these. The current interpretation for doubled consonants is
    > a modern one; I think the historic reading is different (but not
    > quite sure exactly how).

    Here also I have no good hint on which association is prefered,
    except the normative name. Of course this is just an intermediate
    decomposition, and it is expandable before actual use. (In fact
    there are cases where this expansion directly to three letters
    is already needed because there is no corresponding pair, notably
    if we have to map some compatibility clusters to johab clusters,
    and so this view is just to simplify the edition of rules.)

    > There are also some direct errors in your mappings (detailed below).
    > 111B;HANGUL CHOSEONG KAPYEOUNRIEUL;Lo;0;L;<johab> 1105 114C;;;;N;;RQ;;;
    > 111D;HANGUL CHOSEONG KAPYEOUNMIEUM;Lo;0;L;<johab> 1106 114C;;;;N;;MQ;;;
    > 114C;;;;N;;BBQ;;;
    > 112B;HANGUL CHOSEONG KAPYEOUNPIEUP;Lo;0;L;<johab> 1107 114C;;;;N;;BQ;;;
    > -----PLAIN WRONG, yesieung used instead of ieung

    Thanks for pointing these 3 errors. I did not see them despite rereading
    the file so many times, and checking in the generated trace file which
    displays actual characters and not just code points.

    > 11E6;;;;N;;pq;;;
    > ------PLAIN WRONG, 11E6 instead of 11BC

    This one is an obvious copy/paste error when creating rules.

    For the other two alts, I'll look to make them coherent with the
    left-associative rule used generally in canonical decompositions:

    > 1122;HANGUL CHOSEONG PIEUP-SIOS-KIYEOK;Lo;0;L;<johab> 1107
    > 112D;;;;N;;BSG;;;
    > --- one of two alts, 1121 1100 preferable

    For example this rule should effectively a simple extension of
    the rule in the previous line related to 1121. But thanks these
    are not errors by themselves. I still have many tests to do
    with them, by comparing the results from various plain-text
    search operations that should find or exclude matches.

    Also, the file I gave you was the last I had verified, and I
    have another version that includes more characters (notably
    the <narrow> decompositions.

    In fact, it is your your initial comment N1051 document and
    that gave me the idea to reorder the rules in collation order for
    the Hangul script (before that it was in code point order, and
    it was even more difficult to edit and verify manually). I have
    just adapted my parser to use a sorted map (a TreeMap in Java)
    instead of a Vector, just to generate a sorted list on output.

    Thanks a lot.


    (Oh! your message came to the list, despite I gave you my file
    in private with the authorization to copy it, so I suppose I
    can reply publicly here to this one, no? If this was an error,
    admit that it's sometimes difficult to reply to the right
    place when there's no instruction and the initial thread was
    public...) ;-)

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Sun Dec 21 2003 - 18:45:23 EST