Use Of ICU BreakIterator In Lexical Analysis Of Multiple Languages
Intended Audience: |
Software Engineers, Content Developers |
Session Level: |
Beginner, Intermediate |
The IBM Dictionary and Linguistic Tools Group produces linguistic analysis
tools support over 30 different languages. This presentation will describe
the use of ICU BreakIterator in Lexical Analysis. This will include a
description of how ICU is used and how we build on this technology to solve
some lexical analysis issues such as identifying:
- Multi-word expressions
- Unknown abbreviations
- Initial abbreviations
- Numeric bullets
- End Of Sentence detection
- email addresses, urls
The International Components for Unicode(ICU) is a C and C++ library that
provides robust and full-featured Unicode support on a wide variety of
platforms.
The ICU BreakIterator maintain a current position and scan over text
returning the index of characters where boundaries occur.
Word boundary analysis is used by search and replace functions, as well as
within text editing applications that allow the user to select words with a
double click.
|