RE: FW: UTF-8S ??? UTF-16F !!!

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Jun 13 2001 - 13:13:03 EDT


Doug Ewell wrote:
> By coding the transformation inline, and reordering things
> trivially so that
> the test for (u < 0xe000) -- by far the most common case --
> appears first,
> the transformation will degenerate in most cases to:
>
> if (u < 0xe000)
> ;

That doesn't work; I think you meant "u < 0xD800", which yields:
 
        unsigned short utf16_to_utf16f(unsigned short u)
        {
                if (u < 0xD800)
                        return u;
                if (u >= 0xE000)
                        return u - 0x800;
                return u + 0x2000;
        }

However, assuming an optimizing compiler, I don't think that this moving
instructions around has a great effect on the resulting code. Of course the
matter is different if the instructions have to be fine-tuned in assembly.

> If you remove the assert(u <= 0xffff) statement, then the
> same logic can be
> used for data in either UTF-8 or UTF-16, provided that no
> unpaired surrogates
> appear in your data (a reasonable constraint).

I don't get your point here.

The assert() is simply useless, because (in practice, if not in theory) an
unsigned short cannot be greater than 0xFFFF. Removing it changes nothing.
However, it is normally assumed that all assert() are #undefined out after
the testing phase.

Apart this, how can you use the same code for UTF-8? And to do what? The
purpose of Markus' code is to define a pseudo-UTF-16 that sorts like UTF-8.
UTF-8 already sorts like UTF-8...

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT