23rd Internationalization and Unicode Conference

Problems Finding Word Boundaries in Different Languages

Sophy Carroll - IBM Software Group, Ireland

Intended Audience: Managers, Software Engineers, Technical Writers
Session Level: Beginner, Intermediate

The first step in most Natural Language Processing (NLP) applications is breaking the source text into words (sometimes referred to as tokenisation). While we are all familiar with the concept of words, the exact definition is not universally accepted. This makes is difficult to find the breaks between words, because depending upon which definition you use you will choose different break points. In addition some of the word definitions are difficult to implement in a fast tokenisation algorithm. This paper will look at the definition of a word and it will examine some examples in different languages where the word boundary is difficult to find.

Topics covered will include:

Multi word expressions
The variety of use of an apostrophe
Spontaneously generated compound words (commonly found in Germanic languages)
Highly inflected languages such as Finnish or Turkish which can represent almost an entire English sentence in one word
Words which are written without spaces (e.g. Thai)
Languages where a large portion of the words are single character words (e.g.Chinese)
Ambiguity of how to interpret the text

The talk will also briefly discuss how these issues are handled in IBM's Dictionary and Linguistic Tools.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

12 December 2002, Webmaster