Re: Biblical Hebrew (U+034F Combining Grapheme Joiner works)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jun 27 2003 - 22:10:40 EDT

  • Next message: Kenneth Whistler: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"

    Peter countered:

    > > > Could this finally be the missing "killer ap" for the CGJ?
    > >
    > > It will be perfect to allow an application like XML to encode Hebrew
    > > text using Unicode 4.0 rules (and before).
    >
    > It is not perfect. CGJ is supposed to be significant (and kept in the
    > text) for a variety of processes, such as searching and sorting. To use
    > this for Biblical Hebrew, though, it should be ignored in such processes.

    Why? The point is that:

       <patah, CGJ, hiriq>
       
    is one thing, and

       <hiriq, CGJ, patah>
       
    is another. You *want* those sequences to be distinct, right? Even
    if the text has been normalized, right? That was the whole
    problem with:

       <patah, hiriq>
       <hiriq, patah>
       
    which are canonically equivalent, since they both normalize to:

       <hiriq, patah>
       
    So the CGJ *is* significant for searching (and sorting). If you
    want one sequence, you search for <patah, CGJ, hiriq>, if you
    want the other, you search for <hiriq, CGJ, patah>. If you
    don't care, and want to find either, *then* you strip out the
    CGJ and normalize before comparison.

    This, by the way, is completely in keeping with the intended
    treatment of CGJ in other instances and falls out automatically
    from the definition of the UCA for collation. CGJ defaults
    to null weights in the UCA. You tailor combinations of
    characters in contractions with it to get special weights
    for sequences like <c, CGJ, h> if they have to contrast
    with <c, h>. But for Biblical Hebrew, you don't even have
    to do that, because to get the contrast between <patah, CGJ, hiriq>
    and <hiriq, CGJ, patah>, you simply have to have the weights
    for patah and for hiriq and then block the reordering. Voilá,
    it just works. Of course, you are going to have to tailor
    for Biblical Hebrew, anyway, since the points for Hebrew default to
    ignorable, so if you want to search and sort on them, you
    have to give them significant weight differences to start with.

    For a direct search on the binary string, you also don't have to do
    anything to get the appropriate distinction between the
    two representations. I thought this was the goal all along:
    we just have the two vowels, one after the other, and they
    should stay put, not reordering.
     
    > It's another hack.

    And cloning 14 Hebrew vowels and diacritic marks to give
    them new combining classes is not?

    It seems to me that the suggestion of this use of the CGJ
    is much more in keeping with its narrowed semantic as defined
    by the UTC. Remember, we used to think the CGJ was a
    "grapheme cluster constructor" and could be used to build
    targets for enclosing combining marks. For a variety of
    reasons we gave up on that. The text convention I am
    suggesting for Biblical Hebrew is much less of a stretch
    for CGJ than trying to make it serve as a "grapheme cluster
    constructor" was. Essentially it is a no-op. Given CGJ's
    current definition and set of properties, a CGJ introduced
    into the particular vowel contexts you are concerned about should
    result in all the effects you are asking for.

    It might take awhile for Uniscribe and other implementations
    to catch up to that actual behavior, but as I read the
    standard, that is how they *should* behave.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 22:50:09 EDT