This content is not current

You are viewing a previous year’s website. To visit the current UTW site, click here.

Internationalization

Digitally Disadvantaged

Beyond NFC with Nisaba library

on Tue, 15:05in Atomic Clockfor 40min

Brahmic scripts, which are widely used in South and Southeast Asia, contain a large number of multiple-encoded strings which are not formally equivalent by any formal mechanism, such as NFC, in the Unicode standard. This leads to a variety of downstream issues, where two strings in the same script that appear identical to end users may in fact have multiple non-equivalent encoded representations.

Nisaba is a collection of finite-state transducer-based (FST) based open-source libraries in Python and C++ for the normalization of such visually equivalent strings of various scripts from South Asia and beyond. It also provides APIs for well-formedness and fully reversible romanization.

The main objective of this presentation is to raise awareness of the Nisaba library, which can be a valuable resource for i18n and NLP projects dealing with languages from South Asia.

Overview Program