From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Mon Apr 13 2009 - 00:14:15 CDT
Hi,
I've written a simple UTF-8 decoding function that processes a single
byte at a time while the caller maintains its state. As such it is much
easier to use correctly in many situations. What makes this feasible is
having only about a dozen instructions in the function, so it is easily
inlined. For work in progress implementation and documentation see:
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
Essentially it uses a specially constructed table-driven DFA for state
transitions, so the decoding function just does table lookups and the
usual bit magic. To verify that this approach is sound, I have timed a
simple transcoder against some popular UTF-8 to UTF-16 transcoders.
Results are somewhat compiler-, and, I imagine, architecture-specific,
but my implementation appears to come out nicely given its simplicity.
See the web page for results on my system.
regards,
-- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
This archive was generated by hypermail 2.1.5 : Mon Apr 13 2009 - 00:18:22 CDT