RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Peter_Constable@sil.org
Date: Sun Jun 03 2001 - 20:46:47 EDT


One more thought on this topic: the issue has to do with comparing the
results of sorting two data sources. It would seem to me that there's
another issue that has to be taken into consideration here: normalisation.
You can't just do a simple sort using raw binary comparison; you have to
normalise strings before you compare them, even if the comparison is a
binary compare. Why can they not in the process also normalise the way that
strings would binary sort? Various people (on unicoRe) have already
presented efficient algorithms for doing this that would not add
significant overhead to the normalisation process.

If the response is that the particular Oracle clients requesting this have
already ensured that the data sources are already in (say) normalization
form C, then that is one more indication that this is, in fact, a
proprietary solution. If it is to be documented as a UTR (which in practice
must make it an officially approved Unicode encoding form), then the UTR
should also discuss the motivation, which has to do with comparing the sort
results of two data sources, and should point out the need to normalise
those data sources -- if the whole point is to make sure people know that
there are issues involved in making their comparisons valid, then all of
the issues should be pointed out, not just some. I think, though, that
putting the two together will really beg the question.

And remember, if it isn't just a proprietary solution, we *still* need to
deal with the case of two data sources where one is UTF-16 and the other is
UTF-8 or UTF-32 (not UTF-8s or UTF-32s). I still haven't heard from the
advocates of this proposal how they reconcile that issue.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT