Twenty-first International Unicode Conference

Unicode in Natural Language Processing Applications

Thomas Emerson - Basis Technology Corporation

Intended Audience:	Managers, Software Engineers, Marketers, Content Developers
Session Level:	Intermediate

Traditionally, natural language processing (NLP) applications are written to solve a single problem in a single language. However in the last several years it is more common to see NLP frameworks being developed targeted to applications in several languages. Nevertheless, these applications are often limited to handling languages that share a common script (e.g., Western European languages alone) or common encoding scheme (e.g., ISO 8859-n).

This talk outlines the benefits of Unicode when writing natural language processing applications that need to be targeted to multiple languages. This is of particular interest to the members of the European Union with the potential doubling in size of the EU over the next year; they will have over 20 official languages in multiple scripts and encodings.

By unifying on a single character representation, especially one with the extended character semantics defined in Unicode, implementing NLP applications becomes significantly easier. This talk will show how Basis Technology was able to leverage linguistic technology developed for Chinese and Japanese to several European languages.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

21 February 2002, Webmaster