Segmenting Chinese in Unicode
Tom Emerson - Basis Technology Corporation
Intended Audience: |
Manager, Software Engineer |
Session Level: |
Intermediate |
The automatic segmentation of Chinese text is an ongoing problem in information
retrieval and computational linguistics. Because Chinese words are non-space separated,
for many processes which require processing words (e.g., search engines) the word
boundaries need to be algorithmically determined.
This presentation illustrates one Unicode-based approach that was taken for Basis
Technology's simplified and traditional Chinese text segmentation system, the Chinese
Morphological Analyzer. Segmentation is based on a very large dictionary of Chinese
words with part-of-speech information and Chinese morphological knowledge. The talk will
cover how unknown words are dealt with using Chinese word formation rules and
grammatical information. Common segmentation problems and their solutions will also be
discussed.
The engine uses Unicode throughout, allowing it to seamlessly handle traditional and
simplified Chinese text, including ideographs used only in certain Chinese locales such
as Hong Kong. Diverse applications of the segmentation engine will also be covered:
Chinese-to-Chinese script conversion, keyword extraction for information retrieval, and
content filtering.
|