From: Sam Mason (sam@samason.me.uk)
Date: Mon Apr 27 2009 - 09:51:07 CDT
On Mon, Apr 27, 2009 at 04:28:00PM +0200, Bjoern Hoehrmann wrote:
> * Sam Mason wrote:
> >On Sun, Apr 26, 2009 at 10:57:49AM -0700, Mark Davis wrote:
> >> I'd disagree about that. It is certainly simpler to always process as code
> >> points, but where performance is required, you need to design your
> >> algorithms with the encoding of the core string representation in mind,
> >> typically UTF-8 or UTF-16. You can get huge speedups in that way.
> >
> >Are there any pointers to literature about that? I'd be interested
> >to see how this sort of scheme would hang together; there would seem
> >to be quite a trade-off between instruction cache pressure, branch
> >prediction and most probably other effects I can't think of at the
> >moment.
>
> To offer one data point, Python's UTF-8 decoder does this and I've timed
> it against a couple of others. I note that the decoder accepts surrogate
> code points so the comparison is not entirely fair.
The point I was responding to was about doing regex's directly on the
data (i.e. UTF-8 when the source is UTF-8, UTF-16 when the source is
UTF-16, etc.) and not attempting to convert to a standard internal
encoding first.
> The times are at:
>
> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance
>
> Transcoding my name from UTF-8 to UTF-16 is substantially faster due to
> this optimization. That is largely because my processor can predict the
> additional branches almost perfectly (16 byte string, the pattern for
> the check is simply alternating no yes no yes). For the other data sets
> Python's decoder is slower than my own without any such optimization.
>
> I've also added a similar optimization to my own version, using the
> aligned 4 byte read and aligned 16 byte reads using SSE2 instructions.
> The results were similar (that is, for the more complex data sets,
> perfomance got worse).
Have you seen work such as u8u16[1]? It optimises code specifically
for newer instruction sets (MMX, SSE and AltiVec) resulting in higher
performance (about five times faster than your code, and ~20 times
faster than iconv, if I'm reading things correctly). They've patented
it for commercial use and left it free for FOSS use which is OK, but a
tad annoying.
It would be nice if their Parallel Bit Streams paper[2] went into more
details, I've only given it a quick scan and didn't see the sort of
details that interested me anyway!
-- Sam http://samason.me.uk/ [1] http://u8u16.costar.sfu.ca/ [2] http://u8u16.costar.sfu.ca/attachment/wiki/WikiStart/ppopp074-cameron.pdf
This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 09:53:58 CDT