RE: UTF-8 Syntax

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon Jun 11 2001 - 09:12:32 EDT


Toby Phipps wrote:
> This is incredibly inefficient, not only
> because significant amounts of temporary space needs to be
> allocated and
> freed, but also because the entire result set of the query has to be
> processed and sorted before the first row is returned. With
> result sets
> involving several million rows, this is a very significant overhead,
> especially if the typical user only looks at the first couple
> of hundred.

As a second thought, I would examine these statement by a more pragmatic
point of view.

I have no expertise in the internals of databases, so I am sure that Toby
will correct this haphazarded statement: converting UTF-16 (& UTF-8s) binary
order to UTF-8 (and UTF-32) binary order you don't need a complex and
resource-consuming algorithm as you would need for lexical collation.

I dare to say that such an algorithm just needs 2 pointers, 2 binary
searches, and a reorganization of the list.

Suppose that you have a query against a UTF-16 database, and you execute
using your usual super-efficient binary order. At the end, it yields this
result (binary sorted in UTF-16 code units):

        1: 0x0068 0x0069 (U+0068 U+0069)
        2: 0x0068 0x006F (U+0068 U+0069)
        3: 0xD800 0xDF07 0xD800 0xDF09 (U-010307 U-010309)
        4: 0xD800 0xDF07 0xD800 0xDF0F (U-010307 U-01030F)
        5: 0xFF48 0xFF49 (U+FF48 U+FF49)
        6: 0xFF48 0xFF4F (U+FF48 U+FF49)

To deliver this in code-point order (i.e. in UTF-8/UTF-32 order), you just
need this algorithm (coded in an imaginary programming language):

        declare FirstSurrog as row index
        FirstSurrog = SEARCH_FIRST_ROW(
                                where first code unit >= 0xD800
                                and first code unit < 0xE000)
        if found then
                declare FirstPostSurrog as row index
                FirstPostSurrog = SEARCH_FIRST_ROW(
                                                where first code unit >=
0xE000)
                if found then
                        SWAP_ROW_BLOCKS(
                                FirstSurrog .. FirstPostSurrog - 1,
                                FirstPostSurrog .. last)
                end if
         end if

In the case above, the result set would become:

        1: 0x0068 0x0069 (U+0068 U+0069)
        2: 0x0068 0x006F (U+0068 U+0069)
        3: 0xFF48 0xFF49 (U+FF48 U+FF49)
        4: 0xFF48 0xFF4F (U+FF48 U+FF49)
        5: 0xD800 0xDF07 0xD800 0xDF09 (U-010307 U-010309)
        6: 0xD800 0xDF07 0xD800 0xDF0F (U-010307 U-01030F)

Is it correct to assume that the operation which I called SEARCH_FIRST_ROW
above should take a maximum of 16 comparisons on a b-tree?

And is it correct to assume that the operation which I called
SWAP_ROW_BLOCKS can be implemented virtually (i.e. by traversing the tree in
2 passes, visiting different branches), rather than by moving actual data in
memory?

So, why couldn't this be efficient enough also for a list with millions
rows?

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT