Re: Case blind comparison

From: Keld J|rn Simonsen (keld@dkuug.dk)
Date: Mon Aug 11 1997 - 15:07:33 EDT


Kenneth Whistler writes:

> While CD 14651 spells out the multilevel sorting model in great
> detail (based largely on the Canadian standard), the comparison
> operation API leaves much to be desired. In particular, CARABIN
> is modeled on strxfrm(), so that low levels of conformance can
> simply use strxfrm() and claim conformance with the standard.

> But CARABIN does not specify the binary structure required for
> higher levels of conformance. Basically it leaves that up to
> the implementer as an implementation detail. Essentially the
> binary string is a private structure only consumable by the
> corresponding implementation of COMPBIN. This is feasible, I suppose,
> but it does mean that the proposed international standard is not
> proposing a transmissible *data* standard, but instead an API
> with multiple levels of conformance and multiple options that
> will result in typical Unix implementations with subtle mismatches
> and incompatibilities of implementation.

Well, there is a reason for this, and that is that the binary
format is locale dependent, as it will wary with different
sorting specifications. The 14651 standard specifies 8 options
(for for example lower case or upper case first) and implementation
with for example national sorting orders like Danish or Spanish orderin
will allocate different weights to a number of characters.
As the binary data is locale delendent, a standard cannot be prescribed
for the data format.

This is the same in C - and there the sorting routines have enjoyed
very wide acceptance.

> What about COMPCAR for loose comparison? The problem is that
> COMPCAR only introduces the *possibility* of loose comparison
> at level 5 conformance, the highest level with the most stringent
> requirements on implementation. In particular, the level "parameter
> is mandatory only for conformance level 5. When it is not present,
> the assumed value of this parameter is zero, which implies that the
> comparison is done up to the last available level." But comparison
> up to the last available level implies the full use of the
> multilevel algorithm, including ignorables, (as is appropriate
> for determinant sorting algorithms) and doesn't get you loose
> comparison. Loose comparison, as for Gary's "case blind comparison",
> requires specific omission of one or more levels in the generic
> algorithm, and in particular requires truly ignoring ignorables,
> instead of using them for tie-breaking. But ISO 14651 is going
> to be irrelevant for loose comparison except for those Posix-
> compliant platforms which choose to implement COMPCAR at the
> highest level of conformance.

The standard is not intended for just Unix/POSIX systems,
but for all systems and programming languages. The standard is
specified in a programming language independent way.

We are rewriting the conformance level clause, and conformance
will be simpler. I expect most implementations to comply to a level
where the case blind comparison will work. We actually only had the
level without the "precision" requirement, so that C and C++ compliant
compilers could also claim conformance to 14651.
>
> > and it has a template table for all of UCS.
>
> I must take issue with this statement, as well. The template table
> does make a serious effort to cover Latin, Greek, Cyrillic,
> Armenian, Hebrew, and Arabic. The basic problem is that it treats
> all other scripts as consisting of ignorables, which clearly
> produces incorrect results for a default collation. So the
> coverage can hardly claim to be for "all" of UCS, except in a
> very defective sense.

The list of scripts covered will be much improved in the next draft
of 14651, as far as understand. The editor of the sandard is Alain
LaBonté from Québec, Canada. He is maybe the world's leading expert
in sorting. The intention is that only special characters will
be "ignored" in the first three levels, making significance on the
fourth level only.

> Furthermore, there are inconsistencies
> in the treatment of accents between the scripts which are
> covered. Combining marks are not covered in any way which could
> be considered consistent with Unicode, which would result in
> erratic and inconsistent results if the comparison API's are
> applied to Unicode data which includes combining marks.

14651 is applicable to the repertoire of ISO 10646, and to the extent
that Unicode data are within this repertoire, 14651 will define a
template for ordering, with 8 options, that each will define
a deterministic ordering of string with characters of this repertoire.
As there is already 8 options in the standard, and the standard is also
intended to be used for building national orderings, the standard
is expected to be able to generate a number of orderings, but
each of these are consistent and deterministic.

> And
> the specification of the table template itself follows the Posix
> charset model, resulting in a table whose significance for ordering
> of characters cannot be determined by inspection outside the
> context of an actual implementation of the weighting scheme
> implied. These and other defects have been noted in the U.S.
> comments on the CD 14651 document.

This is not true, as we have explained a number of times to Unicode
people.

> So before heaving a sigh of relief that an answer is just
> around the corner, and it will be an ISO standard, to boot,
> you might want to cast a critical eye on the actual document
> that Keld is promoting.

I would hope that you would also be backing it, Ken, with your extensive
knowledge of characters and how to use them.

Keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT