Finite State Automata in Unicode, Take 2
Thomas Emerson - Basis Technology
Corporation
Intended Audience: |
Software Engineers |
Session Level: |
Advanced |
The literature on finite state automata generally assumes a
relatively small alphabet, often 128 or fewer characters. Small
alphabets allow one to implement an FSA efficiently and in a
straight-forward manner. Large alphabets (of which Unicode is a
prime example) can make the efficient and compact implementation of
automata difficult. This talk is a continuation of one given at IUC
21 in Washington DC, describing my latest work on space and time
methods for building and dictionaries for large languages (100,000+
words) over large alphabets (5000+ symbols). I present an
implementation of a recent incremental ADFSA algorithm due to
Daciuk et al. that scales very well to large languages and is not
impacted by alphabet size. I also present a method for representing
the built automata in a compact way. |