Problems Finding Word Boundaries in Different LanguagesSophy Carroll - IBM Software Group, Ireland
The first step in most Natural Language Processing (NLP) applications is breaking the source text into words (sometimes referred to as tokenisation). While we are all familiar with the concept of words, the exact definition is not universally accepted. This makes is difficult to find the breaks between words, because depending upon which definition you use you will choose different break points. In addition some of the word definitions are difficult to implement in a fast tokenisation algorithm. This paper will look at the definition of a word and it will examine some examples in different languages where the word boundary is difficult to find. Topics covered will include:
The talk will also briefly discuss how these issues are handled in IBM's Dictionary and Linguistic Tools. |
When the world wants to talk, it speaks Unicode |
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
to info@global-conference.com.
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. 12 December 2002, Webmaster |