Re: UTF-8S score keeping

From: Tex Texin (texin@progress.com)
Date: Fri Jun 15 2001 - 02:31:55 EDT


Antoine,

I pretty much agree. I copied most of the message, since yours
might have bounced to Unicore.

On the second point you added,
although, technically 16-bit users may have come first, I think
UTF-8 users dominated early, so I wouldn't necessarily give favor
to utf-16 users. But it's a sad state of affairs when we have to
consider favoring one or the other.

Yes I meant utf-16x instead of s.

In the world of application integration, it is not always possible
to control all the applications. Some may belong to third parties.
Therefore, if we buy that we need to match the sort order between
two applications, one based on utf-8 and one utf-16, then it
could be that the one not under your control is utf-8-based. In that
case you may want to use utf-16x in the one you control, so you
provide data in the order expected by the utf-8 app.

thanks for the comments.
tex

Antoine Leca wrote:
> I welcome the initiative. However, I have a couple of minor points
> I feel uncomfortable with.
> >
> > Advantage utf-8s
> > ===================
> > sorts like utf-16, saving 1% CPU
> >
> > allows binary compare
>
> Sorry? What is the point? Where "plain" utf-8 fail to compare "binary"?
>
> In fact, to hold, your point would be rewritten as saying that UTF-8
> does not allow binary compare, which is another way to say that the
> only "correct" comparisons are those with the order implicitly set
> by UTF-16. I believe it would be fair to make explicit this point.
>
> > Only meaningful where queries do not specify an "order by" clause
>
> Hardly an advantage, IMHO, rather a point that greatly restrict the
> impacted area (which can be interpreted as contra).
>
> Please add to your list something along the lines of:
>
> - no need to increase the code with cases to handle surrogates, they
> come for almost free with the present state of affairs (assuming
> no complex implementation of UTF-8)
>
> - give advantage to the users of 16-bit based code (which historically
> have been the first ones) over the users of both 8-bit based UTF-8,
> and the 32-bit based (mainly Unix) folks, which are really followers
> on this technology
>
> - (according to the proponents), already in wide use, in unlabeled form.
>
>
> > Disadvantage utf-8s
> > =====================
> > Potentially there would be a utf-16s and utf-32s as well.
>
> I do not see a need for utf-16s. However, if the need for binary
> order do persist, there will be a need for utf-16*, read "not
> sorted as UTF-16 but rather as UTF-8/UTF-32". Some proposals
> did arrive, named utf-16x and utf-16f, exactly to that intent.
>
> utf-32s does have a subtantial performance hit too. Since
> utf-32[s] are to be used primarly when performance is a requirement,
> this does matter.
>
> > Utf-8s requires more space (6 bytes vs 4)
>
> Hardly matters, given the forecasted use of surrogates characters.
>
>
> > Hardware improves cpu performance 1% in a week.
> >
> > utf-8s requires different counting methods for API and hence a new
> > API
> > a) since supplementary characters now count as two code units
> > b) since number of bytes per code unit is different
>
> Only if UTF-16 is not used internally as the basic encoding.
>
>
> > Table lookups of associated property tables require additional
> > step to calculate table offset (combining surrogates to get
> > index value) or an alternative approach to table format.
> > i.e. requires a mix of data lookup approaches (both UTF-8 and
> > UTF-16 style lookups) instead of just one or the other.
>
> Only if UTF-16 is not used internally as the basic encoding.
>
>
> > Data is most likely re-sorted linguistically for presentation to
> > user anyway
>
> etc., the rest I agree with without any further comment.
>
> Again, thanks for taking the trouble to sumarize.
>
> Antoine

-- 
-------------------------------------------------------------------
Tex Texin                      Director, General Product Manager
mailto:Texin@Progress.com      +1-781-280-4271  Fax:+1-781-280-4655
the Progress Company           14 Oak Park, Bedford, MA 01730
-------------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT