Twenty-first International Unicode Conference

Practical Experiences Using Unicode in Linguistic Databases

Brian O'Donovan - IBM Corporation

Intended Audience:	Software Engineers, Systems Analysts, Content Developers
Session Level:	Beginner

The IBM Dictionary and Linguistic Tools Group produces linguistic analysis tools which support over 30 different languages. The previous version of this product supported a wide variety of code pages for each of the languages. This presentation will describe the capabilities and value of the linguistic tools, and then focus on how the dictionaries were ported from a mixed code page architecture to one that uses Unicode for all dictionaries.

It was a large project, but it was nevertheless very successful and relatively easy. The presentation, will discuss some of the technical problems we encountered when moving our dictionaries to Unicode and what we did to overcome these problems. For example:

To avoid an explosion in the size of our dictionaries we use dynamic transition table maps (this will be explained in the presentation).
We simplified our text analysis routines by using International Components for Unicode (ICU) to find character properties and possible word breaking points.
Customers need to support legacy code pages for backward compatibility, we encourage them to do this via ICU conversion utilities and hence the core engine knows no character encoding scheme other than UTF-16.

I will also discuss some of the benefits we gain from using Unicode e.g.

Dictionary build and maintenance is possible in various locales (i.e. the build machine no longer has to be in the same locale as the dictionary).
It is possible to analyse many varied languages with much less language specific code.
Our architecture for dealing with multi-lingual documents is much simpler.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

11 January 2002, Webmaster