Re: "Deterministic Sorting" (was Re: ZWNJ & Persian Collation)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Mar 13 2003 - 17:19:25 EST

  • Next message: Bob_Hallissy@sil.org: "Re: Tolkien wanta-be has created entirely new language as a base for a sci-f- novel and wants to map these new characters to keyboard."

    Well, maybe 3 things ;-)

    Mark
    ________
    mark.davis@jtcsv.com
    IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
    (408) 256-3148
    fax: (408) 256-0799

    ----- Original Message -----
    From: "Mark Davis" <mark.davis@jtcsv.com>
    To: "Markus Scherer" <markus.scherer@jtcsv.com>; "unicode"
    <unicode@unicode.org>
    Cc: <iranorus@online.ru>
    Sent: Thursday, March 13, 2003 13:04
    Subject: "Deterministic Sorting" (was Re: ZWNJ & Persian Collation)

    > I want to point out two things.
    >
    > 1. UCA provides a mechanism for producing a "deterministic" sort (there
    > called semi-stable). See step 3.10
    > (http://www.unicode.org/reports/tr10/#Step_3).
    >
    > 2. A "deterministic" sort is actually not needed very often; people
    confuse
    > it with a stable sort. See http://www.unicode.org/reports/tr10/#Stability
    >
    > 3. If someone did customize the UCA for numeric sorting, the difference
    > between 002 and 2 could be a tertiary difference. So even without using
    > 3.10, they would be distinguished at level 3.
    >
    > Mark
    > ________
    > mark.davis@jtcsv.com
    > IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
    > (408) 256-3148
    > fax: (408) 256-0799
    >
    > ----- Original Message -----
    > From: "Markus Scherer" <markus.scherer@jtcsv.com>
    > To: "unicode" <unicode@unicode.org>
    > Cc: <iranorus@online.ru>
    > Sent: Wednesday, March 12, 2003 08:48
    > Subject: Re: ZWNJ & Persian Collation
    >
    >
    > > Roozbeh Pournader wrote:
    > > > Well, anything that is completely ignored in collation creates
    problems
    > > > with deterministic sorting.
    > >
    > > I don't think you mean "deterministic". UCA is deterministic, it just
    > sorts many strings as equal.
    > >
    > > > There are certain words in Persian, with
    > > > completely different meanings, that only differ in a ZWNJ[1]. Having
    > ZWNJ
    > > > ignored by default, means they may appear in this or that order,
    > possibly
    > > > based on the original order of input. I guess this is not what we
    want
    > > > for deterministic collation.
    > > >
    > > > The desired behavior for ZWNJ, is being treated like punctuations.
    > > > Ignored in the first levels, but considered at the end. (Personal
    Note:
    > > > write something for UTC on this.)
    > >
    > > Possible. I assume that ZWNJ is ignored in UCA because that is the
    > expected behavior for many other
    > > languages. Not ignoring ZWNJ is possible with a tailoring that gives it
    > some non-zero weights.
    > >
    > > Note that many languages require tailorings for at least a couple of
    > characters to follow national
    > > standards.
    > >
    > > markus
    > >
    > > --
    > > Opinions expressed here may not reflect my company's positions unless
    > otherwise noted.
    > >
    > >
    > >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Mar 13 2003 - 18:12:41 EST