Practical Experiences Using Unicode in Linguistic Databases
Intended Audience: |
Software Engineers, Systems Analysts, Content Developers |
Session Level: |
Beginner |
The IBM Dictionary and Linguistic Tools Group produces linguistic
analysis tools which support over 30 different languages. The previous
version of this product supported a wide variety of code pages for
each of the languages. This presentation will describe the capabilities
and value of the linguistic tools, and then focus on how the dictionaries
were ported from a mixed code page architecture to one that uses Unicode
for all dictionaries.
It was a large project, but it was nevertheless very successful and
relatively easy. The presentation, will discuss some of the technical
problems we encountered when moving our dictionaries to Unicode and
what we did to overcome these problems. For example:
- To avoid an explosion in the size of our dictionaries we use dynamic
transition table maps (this will be explained in the presentation).
- We simplified our text analysis routines by using International Components
for Unicode (ICU) to find character properties and possible word breaking
points.
- Customers need to support legacy code pages for backward compatibility, we
encourage them to do this via ICU conversion utilities and hence the core
engine knows no character encoding scheme other than UTF-16.
I will also discuss some of the benefits we gain from using Unicode e.g.
- Dictionary build and maintenance is possible in various locales (i.e. the
build machine no longer has to be in the same locale as the dictionary).
- It is possible to analyse many varied languages with much less language
specific code.
- Our architecture for dealing with multi-lingual documents is much simpler.
|