From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 14 2004 - 15:56:59 CDT
Dean,
> >> > One normalization script could be used any number of times. Clip,
> >> >normalize, sort - repeat as necessary.
> >>
> >> Multiply that times the number of independent researchers and separate
> >> projects...
> >
> >... and you get a thousand different requirements, each of which
> >should be addressed with appropriate levels of programming tools.
>
> ... that are solved now by a single default process requiring no end user
> fiddling.
No they are *not* "solved now by a single default process" -- you
don't get a thousand different sort orders out of a single
default process.
> >What gives you the slightest hope that *every* researcher's
> >particular needs for searching and sorting can be baked into
> >some *default* collation element weighting table? The whole point
> >is to provide a mechanism for people to *tailor* it as they choose
> >to meet *different* requirements.
>
> No, that is not the whole point -
Yes it *is* the whole point -- of the Unicode Collation Algorithm.
Read the document. It is set up the way it is for a reason, and
it is to provide a mechanism for people to *tailor* the default
table to meet different requirements.
> there is also the point that 90% of our
> work, which is done now by simple, default processes, would, all of a
> sudden, require custom tailoring.
If sorting your data in binary order by code point is sufficient
for your work -- since that is what the "simple, default processes"
actually do -- then more power to you. Transliterate all your
data into Hebrew, using Unicode or ISO 8859-8 or Windows CP 1255
or MacHebrew -- it won't matter, since they all use the same
alphabetic order for the 22 letters, anyway. Then sort binary
and you're done.
If you want to do anything *sophisticated* with your data, they
you are going to get involved with normalization and custom
tailoring of collations. You're also going to get involved with
*other* kinds of manipulations of the data, including lemmatizing
and transliterations, in order to get like to sort with like.
> >Nobody plans to take away your rights and ability to continue
> >doing what you now do, if it works very well for you. Please,
> >sir, continue doing what you are doing with your current data. :-)
>
> It's incredible to me that you and others keep repeating this mantra,
> ignoring the fact (repeated for the nth time) that we will all be forced,
> in our separate research projects, to deal with MULTIPLE, COMPETING encodings.
You will not be "forced" to do anything other than what you are
doing currently. I keep repeating it because it apparently
bears repeating.
--Ken
This archive was generated by hypermail 2.1.5 : Fri May 14 2004 - 15:58:06 CDT