ADFSAs in Unicode, Take 2Thomas Emerson - Basis Technology Corporation
The literature on finite state automata generally assumes a relatively small alphabet, often 128 or fewer characters. Small alphabets allow one to implement an FSA efficiently and in a straight-forward manner. Large alphabets (of which Unicode is a prime example) can make the efficient and compact implementation of automata difficult. This talk is a continuation of one given at IUC 21 in Washington DC, describing my latest work on space and time methods for building and dictionaries for large languages (100,000+ words) over large alphabets (5000+ symbols). I present an implementation of a recent incremental ADFSA algorithm due to Daciuk et al. that scales very well to large languages and is not impacted by alphabet size. I also present a method for representing the built automata in a compact way. |
When the world wants to talk, it speaks Unicode |
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
to info@global-conference.com.
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. 12 December 2002, Webmaster |