Re: FW: UTF-8S ??? UTF-16F !!!

From: Pierpaolo BERNARDI (bernardp@cli.di.unipi.it)
Date: Wed Jun 13 2001 - 15:30:03 EDT


> Doug Ewell wrote:
> > By coding the transformation inline, and reordering things
> > trivially so that
> > the test for (u < 0xe000) -- by far the most common case --
> > appears first,
> > the transformation will degenerate in most cases to:
> >
> > if (u < 0xe000)
> > ;
>
> That doesn't work; I think you meant "u < 0xD800", which yields:
>
> unsigned short utf16_to_utf16f(unsigned short u)
> {
> if (u < 0xD800)
> return u;
> if (u >= 0xE000)
> return u - 0x800;
> return u + 0x2000;
> }
>
> However, assuming an optimizing compiler, I don't think that this moving
> instructions around has a great effect on the resulting code. Of course
the
> matter is different if the instructions have to be fine-tuned in assembly.
>
> > If you remove the assert(u <= 0xffff) statement, then the
> > same logic can be
> > used for data in either UTF-8 or UTF-16, provided that no
> > unpaired surrogates
> > appear in your data (a reasonable constraint).
>
> I don't get your point here.
>
> The assert() is simply useless, because (in practice, if not in theory) an
> unsigned short cannot be greater than 0xFFFF. Removing it changes nothing.
> However, it is normally assumed that all assert() are #undefined out after
> the testing phase.
>
> Apart this, how can you use the same code for UTF-8? And to do what? The
> purpose of Markus' code is to define a pseudo-UTF-16 that sorts like
UTF-8.
> UTF-8 already sorts like UTF-8...
>
> _ Marco

foo.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT