UnicodeIUC21
Program Showcase Registration Accommodation Travel Sponsors
Unicode Standard Conference Board Conference CD Last Conference Past Conferences Next Conference
Abstract

Use Of ICU BreakIterator In Lexical Analysis Of Multiple Languages

Sean Callanan - IBM Ireland

Intended Audience: Software Engineers, Content Developers
Session Level: Beginner, Intermediate

The IBM Dictionary and Linguistic Tools Group produces linguistic analysis tools support over 30 different languages. This presentation will describe the use of ICU BreakIterator in Lexical Analysis. This will include a description of how ICU is used and how we build on this technology to solve some lexical analysis issues such as identifying:

  • Multi-word expressions
  • Unknown abbreviations
  • Initial abbreviations
  • Numeric bullets
  • End Of Sentence detection
  • email addresses, urls

The International Components for Unicode(ICU) is a C and C++ library that provides robust and full-featured Unicode support on a wide variety of platforms.

The ICU BreakIterator maintain a current position and scan over text returning the index of characters where boundaries occur. Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click.


Unicode
When the world wants to talk, it speaks Unicode

UnicodeIUC21
Program Showcase Registration Accommodation Travel Sponsors
Unicode Standard Conference Board Conference CD Last Conference Past Conferences Next Conference
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

21 February 2002, Webmaster