Re: Unicode collation algorithm - interpretation]

From: Jim Melton (jim.melton@acm.org)
Date: Mon Feb 19 2001 - 20:33:03 EST

Next message: DougEwell2@cs.com: "Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)"
Previous message: Beth Kaseman: "Internationalization meeting"
In reply to: J M Sykes: "Re: Unicode collation algorithm - interpretation]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mike,

Thanks for your response. I find myself disappointed that there isn't more
participation in this discussion (from others than you and I), but it will
undoubtedly come ;^)

At 05:05 PM 02/11/2001 +0000 Sunday, J M Sykes wrote:
>I think you misunderstand me. The "maximum level" I was referring to is that
>mentioned in UTR#10, section 4, "Main algorithm", 4.3 "Form a sort key for
>each string", para 2, which reads:
>
><quote>
>An implementation may allow the maximum level to be set to a smaller level
>than the available levels in the collation element array. For example, if
>the maximum level is set to 2, then level 3 and higher weights (including
>the normalized Unicode string) are not appended to the sort key. Thus any
>differences at levels 3 and higher will be ignored, leveling any such
>differences in string comparison.
></quote>

I couldn't have given you the exact quote or reference, but I was aware of
this fact. However, I interpreted it in a manner that suggested that the
operation "set to a smaller level" was not (necessarily?) dynamic at a
given invocation of a collation. I interpreted it to mean that a given
collation could be created from the array with one or more of the upper
levels/weights not appearing.

>We can safely assume that at least some users will require sometimes exact,
>sometimes inexact comparisons (at least for pseudo-equality, to a lesser
>extent for sorting).

No disagreement here! In fact, when I was at Digital, a hard-fought topic
was specifically the one you've been raising: case-[in]sensitive and
"accent"-in]sensitive comparisons and ordering.

>We can also safely assume that users will wish to get the performance
>benefit of some preprocessing.
>
>It is clearly possible to preprocess as far as the end of step 2 of the
>Unicode collation algorithm without committing to a level. I understand you
>to say that several implementors have concluded that this level of
>preprocessing is not cost-effective, in comparison to going all the way to
>the sort key. I am in no position to dispute that conclusion.

Actually, that may not be what I meant. I say "may" because I'm still not
sure that we're talking about the same thing. What I meant was that I
believe that some implementations produce a code module that provides the
behaviors of the Unicode collation algorithm for a specific collation
element table (I believe this is the right term --- I mean the table that
indicates the weights applied to each text element for a particular
culture, script, language, etc.). The code that this module contains would
implement each step of the algorithm, but would have a preset, unchangeable
answer for "the maximum level in the collation element array" mentioned in
step 3.1 of the algorithm in UTR#10.

However, by reading the collation algorithm *after* reading recent messages
from you, I now see that there is a different interpretation that I have no
reason to believe is actually prohibited or not intended --- that the
choice to "de-append" one or more levels might be done dynamically.

Nonetheless, I believe that there may be implementations (conforming ones,
I think?) that do not support such dynamic selection of "maximum level".

>I'm also unclear what an SQL-implementor is likely to supply as "a
>collation", though I imagine (only!) that it might be a part only of the
>CTT/CET appropriate to the script used by a particular culture, and with
>appropriate tailoring. But I have no reason to expect the executable
>("compiled"?) code the implements the algorithm to vary depending on the
>collation, or on the level (case-blind &c) specified by the user for a
>particular comparison.

As I stated above, I think there may be such implementations, but I would
be very happy to have this refuted (even if by a statement from the Unicode
people and by the ISO 14651 people that such an implementation would be
non-conforming). It is certainly very useful to Western cultures to
quickly and inexpensively provide case-varying and "accent"-varying
collations, even though these notions may be totally alien to many other
cultures.

> > Of course, if you really want to specify an SQL collation name that
> > somehow identifies 2 or 3 or 4 (or more) collations built in
> > conformance with ISO
> > 14651 and then use an additional parameter to choose between them, I guess
> > that's possible (but not, IMHO, desirable).
>
>Unless you mean for performance reasons, I'd be interested to know why not
>desirable.

Actually, I meant that I would find it undesirable to build a "wrapper"
around three, four, or more collation routines that merely accepts the
additional parameter and selects among the "nested" collation
routines. That seems unnecessarily awkward and clumsy.

I think I see that we've been talking at cross purposes (or at least with
different assumptions and understandings), which is not uncommon in such
discussions. I hope I've sorted things out now, at least for myself. If
you, or somebody, is able to convince me that all (conforming)
implementations of the Unicode collation algorithm must be able to select
the maximum level dynamically, then I think we are going to be in firm
agreement about approaching this. If not, then I will probably remain a
bit skeptical ;^)

I note that Tex Texin sent out a message that certainly seems to support
your interpretation that the levels can be set dynamically. In fact, Tex's
explanations were enormously helpful to me in understanding the
implementation approaches that are likely to be taken (thanks, Tex!). If
Tex's explanation is authoritative, then we're probably done and I am both
happy and in agreement with this approach. However, I had not previously
heard the interpretations that I got from Tex's note so I am obviously
still learning...

Thanks!
Jim
========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144
Oracle Corporation Oracle Email: mailto:jim.melton@oracle.com
1930 Viscounti Drive Standards email: mailto:jim.melton@acm.org
Sandy, UT 84093-1063 Personal email: mailto:jim.melton@acm.org
USA Fax : +1.801.942.3345
========================================================================
= Facts are facts. However, any opinions expressed are the opinions =
= only of myself and may or may not reflect the opinions of anybody =
= else with whom I may or may not have discussed the issues at hand. =
========================================================================

Next message: DougEwell2@cs.com: "Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)"
Previous message: Beth Kaseman: "Internationalization meeting"
In reply to: J M Sykes: "Re: Unicode collation algorithm - interpretation]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT