Doug Ewell wrote,
> Philipp Reichmuth <uzsv2k at uni-bonn dot de> wrote:
>
> > Is there a standard way to handle ZWJ/ZWNJ in sorting & searching?
> > I think in quite a lot of situations and/or scripts it would be
> > feasible just to ignore ZWJ (or give the user the choice to ignore
> > it). Especially in a Latin context.
>
> I would ignore ZWJ, ZWNJ, and any other formatting marks in searching
> and sorting.
>
Quoting from TUS 3.0 page 317:
"ZERO WIDTH NON-JOINER or ZERO WIDTH JOINER are format control
characters. As with other such characters, they should be ignored by
processes that analyze text content. For example, a spelling-checker or
find/replace operation should filter them out. (See Section 2.7, Special
Character and Noncharacter Values, for a general discussion of format
control characters.)"
Philipp Reichmuth mentions offering the user a choice. It might not
be a bad idea for some apps to offer advanced features which would
allow the user to seek/display/process the format characters.
Note that the quote above is plain text, which effectively conveys
the information in the book. With mark-up, the book's text could
be reproduced (more-or-less) as:
"<font face="Minion"><small caps>ZERO WIDTH NON-JOINER</small caps> or
<small caps>ZERO WIDTH JOINER</small caps> are format control characters. As with
<br>
other such characters, they should be ignored by processes that analyze text content. For
<br>
example, a spelling-checker or find/replace operation should filter them out. (See
<br>
<i>Section 2.7, Special Character and Noncharacter Values,</i> for a general discussion of format
<br>
control characters.)</font>"
This paragraph uses a ligature twice. In the mark up version above,
ZWJ was inserted using SCUnipad. This doesn't make the ligature
display here.
The small caps tag was made up for this example, don't know if HTML
has such a tag. In HTML, the font face tag used above is deprecated.
Of the following...
fidelity (with ZWJ)
fidelity (without ZWJ)
fidelity (using the presentation form as UTF-8)
...Outlook Express' Edit/Find feature finds only the second example.
My expectation would be that a default condition search find all three
instances in a Unicode-savvy application.
Best regards,
James Kass.
This archive was generated by hypermail 2.1.2 : Sun Jun 02 2002 - 17:24:06 EDT