UTW 2023

Internationalization
Digitally Disadvantaged

Beyond NFC with Nisaba library

Cibu Johny

on  Tue, 15:05in  Atomic Clockfor  40min

Brahmic scripts, which are widely used in South and Southeast Asia, contain a large number of multiple-encoded strings which are not formally equivalent by any formal mechanism, such as NFC, in the Unicode standard. This leads to a variety of downstream issues, where two strings in the same script that appear identical to end users may in fact have multiple non-equivalent encoded representations.

Nisaba is a collection of finite-state transducer-based (FST) based open-source libraries in Python and C++ for the normalization of such visually equivalent strings of various scripts from South Asia and beyond. It also provides APIs for well-formedness and fully reversible romanization.

The main objective of this presentation is to raise awareness of the Nisaba library, which can be a valuable resource for i18n and NLP projects dealing with languages from South Asia.

 Overview  Program