24th Internationalization and Unicode Conference

Finite State Automata in Unicode, Take 2

Thomas Emerson - Basis Technology Corporation

Intended Audience:	Software Engineers
Session Level:	Advanced

The literature on finite state automata generally assumes a relatively small alphabet, often 128 or fewer characters. Small alphabets allow one to implement an FSA efficiently and in a straight-forward manner. Large alphabets (of which Unicode is a prime example) can make the efficient and compact implementation of automata difficult. This talk is a continuation of one given at IUC 21 in Washington DC, describing my latest work on space and time methods for building and dictionaries for large languages (100,000+ words) over large alphabets (5000+ symbols). I present an implementation of a recent incremental ADFSA algorithm due to Daciuk et al. that scales very well to large languages and is not impacted by alphabet size. I also present a method for representing the built automata in a compact way.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

30 May 2003, Webmaster