Problems Finding Word Boundaries in Different Languages
Sophy Carroll - IBM Software Group,
Ireland
Intended Audience: Managers, Software Engineers, Technical
Writers |
|
Session Level: Beginner, Intermediate |
|
The first step in most Natural Language Processing (NLP)
applications is breaking the source text into words (sometimes
referred to as tokenisation). While we are all familiar with the
concept of words, the exact definition is not universally accepted.
This makes is difficult to find the breaks between words, because
depending upon which definition you use you will choose different
break points. In addition some of the word definitions are
difficult to implement in a fast tokenisation algorithm. This paper
will look at the definition of a word and it will examine some
examples in different languages where the word boundary is
difficult to find. Topics covered will include:
- Multi word expressions
- The variety of use of an apostrophe
- Spontaneously generated compound words (commonly found in
Germanic languages)
- Highly inflected languages such as Finnish or Turkish which can
represent almost an entire English sentence in one word
- Words which are written without spaces (e.g. Thai)
- Languages where a large portion of the words are single
character words (e.g.Chinese)
- Ambiguity of how to interpret the text
The talk will also briefly discuss how these issues are handled
in IBM's Dictionary and Linguistic Tools. |