searching and ZWJ / ZWNJ

From: Peter_Constable@sil.org
Date: Fri Mar 01 2002 - 12:08:00 EST


I was asked this and wasn't entirely sure about the answer, or even sure
if I knew of a doc in which it was discussed. (Note: this is being asked
in relation to Devanagari.)

<quote>
1. Are ZWJ and ZWNJ invisible in terms of searching, sorting, etc? In
other
words, the sequence <consonant> <virama> <consonant> and <consonant>
<virama> <ZWJ|ZWNJ> <consonant> are semantically exactly equivalent. The
ZWJ and ZWNJ are simply controlling the appearance on the screen. When
searching for that cluster, I won't necessarily know whether the user
inserted a ZWJ to keep them from combining or not, and I don't want to
have
to enter it into the search string. Does the Unicode standard say anything

about this, or is it up to the application developer as to how his
searching and sorting works?
</quote>

I didn't see anythink like it mentioned in 5.17 of TUS3.0. ZW(N)J are
mentioned in UTR18 in relation to the definition of grapheme clusters, so
apparently it is assumed that a regular expression wildcard search can see
these without distinguishing between them, but can they be ignored? I
don't know.

Anybody know what the answer is on this?

TIA
- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Mar 01 2002 - 12:26:18 EST