Re: UTF-8S score keeping

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Thu Jun 14 2001 - 11:40:35 EDT


I guess I should be bounced at unicoRe. I hope the interested people
will monitor unicoDe.

Tex Texin wrote:
>
> I am losing track of the discussion, so I decided to create my
> own score sheet.

I welcome the initiative. However, I have a couple of minor points
I feel uncomfortable with.

> So far I have:
>
> Advantage utf-8s
> ===================
> sorts like utf-16, saving 1% CPU
>
> allows binary compare

Sorry? What is the point? Where "plain" utf-8 fail to compare "binary"?

In fact, to hold, your point would be rewritten as saying that UTF-8
does not allow binary compare, which is another way to say that the
only "correct" comparisons are those with the order implicitly set
by UTF-16. I believe it would be fair to make explicit this point.

 
> Only meaningful where queries do not specify an "order by" clause

Hardly an advantage, IMHO, rather a point that greatly restrict the
impacted area (which can be interpreted as contra).

Please add to your list something along the lines of:

- no need to increase the code with cases to handle surrogates, they
come for almost free with the present state of affairs (assuming
no complex implementation of UTF-8)

- give advantage to the users of 16-bit based code (which historically
have been the first ones) over the users of both 8-bit based UTF-8,
and the 32-bit based (mainly Unix) folks, which are really followers
on this technology

- (according to the proponents), already in wide use, in unlabeled form.

 
> Disadvantage utf-8s
> =====================
> Potentially there would be a utf-16s and utf-32s as well.

I do not see a need for utf-16s. However, if the need for binary
order do persist, there will be a need for utf-16*, read "not
sorted as UTF-16 but rather as UTF-8/UTF-32". Some proposals
did arrive, named utf-16x and utf-16f, exactly to that intent.

utf-32s does have a subtantial performance hit too. Since
utf-32[s] are to be used primarly when performance is a requirement,
this does matter.

> Utf-8s requires more space (6 bytes vs 4)

Hardly matters, given the forecasted use of surrogates characters.

 
> Hardware improves cpu performance 1% in a week.
>
> utf-8s requires different counting methods for API and hence a new
> API
> a) since supplementary characters now count as two code units
> b) since number of bytes per code unit is different

Only if UTF-16 is not used internally as the basic encoding.

 
> Table lookups of associated property tables require additional
> step to calculate table offset (combining surrogates to get
> index value) or an alternative approach to table format.
> i.e. requires a mix of data lookup approaches (both UTF-8 and
> UTF-16 style lookups) instead of just one or the other.

Only if UTF-16 is not used internally as the basic encoding.

 
> Data is most likely re-sorted linguistically for presentation to
> user anyway

etc., the rest I agree with without any further comment.

Again, thanks for taking the trouble to sumarize.

Antoine



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT