RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Jun 05 2001 - 07:45:44 EDT


Mark Davis wrote:
> The scenario is:
>
> Client software uses UTF-16.
>
> Database software uses UTF-8s.
>
> Client wants to have string A < string B if and only if
> Database has A < B.
> (where A and B are in the respective client/database encodings).

This would be a sound principle, and there would not be much to say against
it: the database server determines the order of the records in a result set
(whatever the server chooses it to be: binary UTF-16, binary UTF-8, binary
shift-JIS, randomly shuffled, whatever), and the client simply accepts the
records in that order.

A client has no reason to expect data in any particular order, unless it
explicitly requested an "order by" clause (but this is another story,
because this is when any internal binary order is ruled out, and more
sophisticated and costly sorting techniques pop in).

But the real principle behind UTF-8s seems another. The proposal rather
implies the opposite rule:

        Database must supply string A < string B if and only if it would be
A < B for the client.
        (where A and B are in the respective client/database encodings).

This is the can of worms: the client ("client" in all senses: SAP, right?)
performs on its own a job, sorting, that it should not be entitled to do.
And it does it wrong. And the server should be hacked to match broken
expectations by a broken client software.

But a database engine is entitled to used whatever text format and sorting
techniques internally, as far as these formats and techniques are efficient
and yield an ordering that allows finding data fast. There is no need to
publish these formats and techniques, and even less need to standardize
them.

When data is sent outside the engine, it should be converted to a standard
encoding. In the case of Unicode, one of UTF-8, UTF-16 or UTF-32 should be
used. The order of records in a set of returned data should be
*undetermined*, by default.

In the case that a particular sorting order was requested by the client,
that ordering should be implemented without compromises: numerical order for
numerical keys, chronological order for date/time keys, localized lexical
order for text fields.

This "proper" ordering is certainly costly in terms of performance, but it
must be stressed that it is (a) optional and (b) only needed on the *final*
result set, not for all the intermediate lists, that existed for just a few
milliseconds during the execution of the query.

I am not an expert of the internals of database engines, so my summary of
the working of a client/server database is probably very naive. But, unless
my description is total fantasy, so far there is no place for UTF-8s in
this.

So why is Oracle pushing for it? (And, by the way, why don't they explain it
clearly, rather than letting people guessing it?)

Anyway, I will try imagining a different scenario.

IMHO, the "need" for UTF-8s is not primarily in client/server communication,
but rather in server/server communication.

As I said above, the temporary lists needed during the execution of a query
can be in any arbitrary binary order, and this order does not need to be
public.

But this only holds true as long the whole query is executed inside *one*
database engine. The matter becomes more complicated in a distributed system
with more than one database engines of different makes.

When a join query is executed against database A (say, Oracle in UTF-8) and
a database B (say, MS SQL Server in UTF-16), also some of the temporary
intermediate lists must go out from the database engines.

E.g., imagine that I want the names of all adult customers stored in *both*
databases:

        select distinct a.Name
        from dbA.Customers a, dbB.Customers b
        where a.Name = b.Name
        and a.Age >= 18
        and b.Age >= 18
        order by a.Name

I guess that a query of this kind is executed according to this rough plan:

1) dbA prepares its list lA of Customers having Age >= 18 and sends it to
the controlling task.

2) dbB prepares its list lB of Customers having Age >= 18 and sends it to
the controlling task.

3) [This step is unused]

4) The controlling task (wherever its running: dbA, dbB, the client)
traverses lists lA and lB, copying records found in both to a result list
lX.

5) The controlling task sorts lX according to some cute cultural-expected
collation.

The problem is that the lists sent out by the two databases in steps 1 and 2
are, theoretically, in random order. In practice, they are certainly in some
internal binary order but, if dbA and dbB are not running identical engines,
there is no warranty that these orders are the same.

However, in order for step 4 to take place in an efficient way, the two
lists must be in the same sorting order, so that the task building lX can
traverse them in a linear way, with no need of searching one of them.

In an ideal word, step 3 should read: "The controlling task re-sorts lists
lA and/or lB according to the binary order of its choice". But we know that
this would have too big an impact on performance of the overall query.

So, in a way or another, dbA and dbB should agree on emitting the *same*
sorting order, at least in this case.

As UTF-16 databases were implemented first and/or because they are more
widespread, Oracle decided to be proactive and adapt *their* UTF-8
implementation to emit the same order as UTF-16. The only way to do this
efficiently is to convert characters >= 0x10000 to surrogate pairs *also* in
UTF-8.

Thanks to this hack, step 4 may be executed fast and, thus, Oracle is now
able to put in place a cute multi-platform distributed database solution,
and sell it to their customers.

But, then, Oracle noticed (or their customers made them notice): "Hey now we
can grant that the output of *our* database (dbA) is in the same order as
the output of our competitor's database (dbB)."

"But who grants to *us* that our competitor *will* maintain that order in
the future? And, if they decide to change, what will be of the cute
multi-platform systems that we have already promised and/or delivered to our
customers?"

"And considering that, after all, our competitors' binary order is a *bug*
(0x1000 coming before 0xE000: how would you call it!?), aren't we offering
to them our head on a silver plate? With a single action, they can fix their
sorting order *and* put us in serious trouble with our customer."

It was probably after this brainstorming, that came the brilliant move of
proposing UTF-8S to Unicode: "if our hack to match their bug gets
incorporated in an international standard, Microsoft will not dare to change
it, will they?"

But, dear Oracle, I am afraid that this is a loosing move! I can foresee the
chess-mate:

a) The UTC approves UTF-8s? Microsoft thanks Oracle for having worked so
hard to sanctify their criticizable sorting order.

b) The UTC rejects UTF-8s? Microsoft shake their heads and say to Oracle:
"Of course. We didn't want to tell you, but it was such a silly idea".

In any case, how does the mere fact that UTF-8s exists force any other
database producer to use it? Or to use another encoding that yields the same
sorting order? If Microsoft want to implement *their* UTF-8 or UTF-32 (or,
why not, "UTF-24") database with straight sorting order, they will do it in
spite of this.

Do you want my humble idea of a stronger move? :-)

Withdraw your proposal of UTF-8s. Keep it in Oracle as a *proprietary*
connectivity extension and clearly *deprecate* it.

Explain in the documentation that this *deprecated* option is only to be
used if one cannot avoid to connect to 3rd party databases emitting
"pre-Unicode 2.0 UCS-2-like order".

Then let your customers ask their engineers: "Really we *cannot* avoid to
connect our Oracle to that thing with such a horrible name?" -- "You mean
the 'pre-Unicode 2.0 UCS-2-like order'? That's SQL Server, but I heard that
they are running to fix their sorting order in next release".

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT