Re: searching and ZWJ / ZWNJ

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Mar 01 2002 - 14:53:46 EST


Peter,

> I was asked this and wasn't entirely sure about the answer, or even sure
> if I knew of a doc in which it was discussed. (Note: this is being asked
> in relation to Devanagari.)
>
> <quote>
> 1. Are ZWJ and ZWNJ invisible in terms of searching, sorting, etc? In
> other
> words, the sequence <consonant> <virama> <consonant> and <consonant>
> <virama> <ZWJ|ZWNJ> <consonant> are semantically exactly equivalent. The
> ZWJ and ZWNJ are simply controlling the appearance on the screen. When
> searching for that cluster, I won't necessarily know whether the user
> inserted a ZWJ to keep them from combining or not, and I don't want to
> have
> to enter it into the search string. Does the Unicode standard say anything
>
> about this, or is it up to the application developer as to how his
> searching and sorting works?
> </quote>

The answer for this can be found in UTS #10 Unicode Collation Algorithm.

The default behavior that many people will be implementing for searching
and sorting will depend on the default collation table for the UCA.
And in that table, with few exceptions, format control characters,
such as ZWJ and ZWNJ are given completely ignorable weight values.
In particular, in the current allkeys.txt, you find:

200C ; [.0000.0000.0000.0000] # [200C] ZERO WIDTH NON-JOINER
200D ; [.0000.0000.0000.0000] # [200D] ZERO WIDTH JOINER

The all-zero weights mean that these two characters would contribute
nothing to sort key weight generation, and thus would effectively
be ignored for all comparisons.

Of course, the values can be tailored to any value desired. This
may not always be apparent to end users, however, as it will depend
on how particular implementations of the UCA surface their tailoring
options. In many instances, people will depend on a number of
preset options for searching and sorting via a GUI interface, or
at best an API, and won't be tinkering down at the level of
individual character weight assignments.

On the other hand, not all searching and sorting of Unicode data
will be making use of the Unicode Collation Algorithm (or ISO 14651),
and in those instances, different behavior may occur. For example,
for simple, binary string comparisons, the presence or absence
of any character, including a ZWJ or ZWNJ, obviously *would* make
a difference in results.

--Ken

>
> I didn't see anythink like it mentioned in 5.17 of TUS3.0. ZW(N)J are
> mentioned in UTR18 in relation to the definition of grapheme clusters, so
> apparently it is assumed that a regular expression wildcard search can see
> these without distinguishing between them, but can they be ignored? I
> don't know.
>
> Anybody know what the answer is on this?
>
> TIA
> - Peter



This archive was generated by hypermail 2.1.2 : Fri Mar 01 2002 - 15:16:43 EST