From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Sun Apr 26 2009 - 00:01:56 CDT
Hello,
A while ago I published the Perl module Unicode::SetAutomata on CPAN.
Consider you have a UTF-8 encoded string and wish to know whether it is
a letter followed by a digit. You could use a finite automaton to check
for that, it would go like this:
+---------+ +---------+ +---------+
| | Letter | | Digit | |
| Start | ----------> | | ---------> | Final |
| | | | | |
+---------+ +---------+ +---------+
If you just worry about bytes, this would be easily implemented. You
could just make a simple array that encodes for each byte whether it
is a letter, a digit, or something else. With the number of allocated
code points in Unicode, this would require too much space, alternate
encodings would be needed. Consider a corresponding regular expression:
/[:UnicodeLetter:][:UnicodeDigit:]/
Which would expand so something like
/(A|B|...|a|b|...|À|...)(...)/
Now, if we replace each character by its UTF-8 encoding, we would ob-
tain a regular expression and corresponding automata that match the
same language, but would operate directly on bytes:
/(A|B|...|a|b|...|\xC3\x80|...)(...)/
This is the first problem that the module solves: you can pass it a
character class and it would give you a short regular expression that
matches UTF-8 encoded strings that encode one of the characters in the
class. The second problem is a related one. It gives you an automaton
like this:
+---------+
| |
+--- ... --> | Letter |
/ | |
/ +---------+
+---------+ +---------+
| | | |
| Start | ---- ... ---> | Digit |
| | | |
+---------+ +---------+
\ ###########
\ # #
+--- ... --> # Other #
# #
###########
That is, you can pass it a couple of character classes, and it gives
you a deterministic finite automaton with a distinct accepting states
for each of the (disjoint) character classes. The first automaton
could then be implemented in a two stage process: pass bytes to the
character class determining automaton until a character has been read,
then advance the class-based automaton using that character class.
There are a number of existing text processing tools that currently
lack usable Unicode support; the first feature offers a simple way to
use them with UTF-8 encoded data. The second feature was implemented
mainly because it solves the first problem for many character classes
at once, and to study whether such an approach as feasible.
Currently I do not have the time to actually study that, but a rule
of thumb seems to be that the 4-tuples returned by the module take
up about as much space as inversion lists would, assuming that most
automata would have less than 256 states so each 4-tuple can be en-
coded as 32 bit integer.
Generously commented code and some documentation are available at:
http://search.cpan.org/dist/Unicode-SetAutomaton/
regards,
-- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 00:07:09 CDT