ADFSAs in Unicode, Take 2
Thomas Emerson - Basis Technology
Corporation
Intended Audience: |
Software Engineers |
Session Level: |
Advanced |
The literature on finite state automata generally assumes a
relatively small alphabet, often 128 or fewer characters. Small
alphabets allow one to implement an FSA efficiently and in a
straight-forward manner. Large alphabets (of which Unicode is a prime example) can make
the efficient and compact implementation of automata difficult.
This talk is a continuation of one given at IUC 21 in Washington
DC, describing my latest work on space and time methods for
building and dictionaries for large languages (100,000+ words) over
large alphabets (5000+ symbols). I present an implementation of a
recent incremental ADFSA algorithm due to Daciuk et al. that scales
very well to large languages and is not impacted by alphabet size.
I also present a method for representing the built automata in a
compact way. |