From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jun 27 2003 - 22:10:40 EDT
Peter countered:
> > > Could this finally be the missing "killer ap" for the CGJ?
> >
> > It will be perfect to allow an application like XML to encode Hebrew
> > text using Unicode 4.0 rules (and before).
>
> It is not perfect. CGJ is supposed to be significant (and kept in the
> text) for a variety of processes, such as searching and sorting. To use
> this for Biblical Hebrew, though, it should be ignored in such processes.
Why? The point is that:
<patah, CGJ, hiriq>
is one thing, and
<hiriq, CGJ, patah>
is another. You *want* those sequences to be distinct, right? Even
if the text has been normalized, right? That was the whole
problem with:
<patah, hiriq>
<hiriq, patah>
which are canonically equivalent, since they both normalize to:
<hiriq, patah>
So the CGJ *is* significant for searching (and sorting). If you
want one sequence, you search for <patah, CGJ, hiriq>, if you
want the other, you search for <hiriq, CGJ, patah>. If you
don't care, and want to find either, *then* you strip out the
CGJ and normalize before comparison.
This, by the way, is completely in keeping with the intended
treatment of CGJ in other instances and falls out automatically
from the definition of the UCA for collation. CGJ defaults
to null weights in the UCA. You tailor combinations of
characters in contractions with it to get special weights
for sequences like <c, CGJ, h> if they have to contrast
with <c, h>. But for Biblical Hebrew, you don't even have
to do that, because to get the contrast between <patah, CGJ, hiriq>
and <hiriq, CGJ, patah>, you simply have to have the weights
for patah and for hiriq and then block the reordering. Voilá,
it just works. Of course, you are going to have to tailor
for Biblical Hebrew, anyway, since the points for Hebrew default to
ignorable, so if you want to search and sort on them, you
have to give them significant weight differences to start with.
For a direct search on the binary string, you also don't have to do
anything to get the appropriate distinction between the
two representations. I thought this was the goal all along:
we just have the two vowels, one after the other, and they
should stay put, not reordering.
> It's another hack.
And cloning 14 Hebrew vowels and diacritic marks to give
them new combining classes is not?
It seems to me that the suggestion of this use of the CGJ
is much more in keeping with its narrowed semantic as defined
by the UTC. Remember, we used to think the CGJ was a
"grapheme cluster constructor" and could be used to build
targets for enclosing combining marks. For a variety of
reasons we gave up on that. The text convention I am
suggesting for Biblical Hebrew is much less of a stretch
for CGJ than trying to make it serve as a "grapheme cluster
constructor" was. Essentially it is a no-op. Given CGJ's
current definition and set of properties, a CGJ introduced
into the particular vowel contexts you are concerned about should
result in all the effects you are asking for.
It might take awhile for Uniscribe and other implementations
to catch up to that actual behavior, but as I read the
standard, that is how they *should* behave.
--Ken
This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 22:50:09 EDT