Unicode in Natural Language Processing Applications
Thomas Emerson - Basis Technology Corporation
Intended Audience: |
Managers, Software Engineers, Marketers, Content Developers |
Session Level: |
Intermediate |
Traditionally, natural language processing (NLP) applications are
written to solve a single problem in a single language. However in the
last several years it is more common to see NLP frameworks being
developed targeted to applications in several
languages. Nevertheless, these applications are often limited to
handling languages that share a common script (e.g., Western European
languages alone) or common encoding scheme (e.g., ISO 8859-n).
This talk outlines the benefits of Unicode when writing natural
language processing applications that need to be targeted to multiple
languages. This is of particular interest to the members of the
European Union with the potential doubling in size of the
EU over the next year; they will have over 20 official languages in
multiple scripts and encodings.
By unifying on a single character representation, especially one with
the extended character semantics defined in Unicode, implementing NLP
applications becomes significantly easier. This talk will show how
Basis Technology was able to leverage linguistic technology developed
for Chinese and Japanese to several European languages.
|