The Unicode Retrieval System Architecture (URSA) is a fully
Unicode-based retrieval engine for UNIX systems. Unicode is
instrumental in tokenizing multilingual texts for retrieval purposes,
and serves as the common intermediary representation for queries and
documents in URSA. In this presentation, I will illustrate the role
of Unicode in the text processing pipeline involved in parsing,
tokenizing and indexing documents. I will show how language-dependent
issues (Chinese segmentation, Korean morphology) interact with
character-set issues (whitespace determination) in a high-performance
indexing and retrieval engine capable of greater than 400 Mb/hour
indexing speeds that result in indexes of around 20% the size of the
original text collection. I will also show two real-time
demonstrations of the URSA engine in use. The first demonstration
will show a visualization system for examining the results of a
retrieval that departs significantly from standard summary-based
approaches to ranked results. The second demonstration will show how
an interactive cross-language or "translingual" retrieval system can
take advantage of Unicode support to help non-bilingual personnel
effectively navigate, query and retrieve documents in foreign
languages. In conclusion, I will describe several related projects
that are using URSA libraries in developing multilingual
text-processing applications that extend the architecture beyond
simple document retrieval, and which demonstrate the versatility of
full Unicode support in a retrieval architecture.
|