I guess I should be bounced at unicoRe. I hope the interested people
will monitor unicoDe.
Tex Texin wrote:
>
> I am losing track of the discussion, so I decided to create my
> own score sheet.
I welcome the initiative. However, I have a couple of minor points
I feel uncomfortable with.
> So far I have:
>
> Advantage utf-8s
> ===================
> sorts like utf-16, saving 1% CPU
>
> allows binary compare
Sorry? What is the point? Where "plain" utf-8 fail to compare "binary"?
In fact, to hold, your point would be rewritten as saying that UTF-8
does not allow binary compare, which is another way to say that the
only "correct" comparisons are those with the order implicitly set
by UTF-16. I believe it would be fair to make explicit this point.
> Only meaningful where queries do not specify an "order by" clause
Hardly an advantage, IMHO, rather a point that greatly restrict the
impacted area (which can be interpreted as contra).
Please add to your list something along the lines of:
- no need to increase the code with cases to handle surrogates, they
come for almost free with the present state of affairs (assuming
no complex implementation of UTF-8)
- give advantage to the users of 16-bit based code (which historically
have been the first ones) over the users of both 8-bit based UTF-8,
and the 32-bit based (mainly Unix) folks, which are really followers
on this technology
- (according to the proponents), already in wide use, in unlabeled form.
> Disadvantage utf-8s
> =====================
> Potentially there would be a utf-16s and utf-32s as well.
I do not see a need for utf-16s. However, if the need for binary
order do persist, there will be a need for utf-16*, read "not
sorted as UTF-16 but rather as UTF-8/UTF-32". Some proposals
did arrive, named utf-16x and utf-16f, exactly to that intent.
utf-32s does have a subtantial performance hit too. Since
utf-32[s] are to be used primarly when performance is a requirement,
this does matter.
> Utf-8s requires more space (6 bytes vs 4)
Hardly matters, given the forecasted use of surrogates characters.
> Hardware improves cpu performance 1% in a week.
>
> utf-8s requires different counting methods for API and hence a new
> API
> a) since supplementary characters now count as two code units
> b) since number of bytes per code unit is different
Only if UTF-16 is not used internally as the basic encoding.
> Table lookups of associated property tables require additional
> step to calculate table offset (combining surrogates to get
> index value) or an alternative approach to table format.
> i.e. requires a mix of data lookup approaches (both UTF-8 and
> UTF-16 style lookups) instead of just one or the other.
Only if UTF-16 is not used internally as the basic encoding.
> Data is most likely re-sorted linguistically for presentation to
> user anyway
etc., the rest I agree with without any further comment.
Again, thanks for taking the trouble to sumarize.
Antoine
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT