This content is not current
You are viewing a previous year’s website. To visit the current UTW site, click here.
Beyond NFC with Nisaba library
Brahmic scripts, which are widely used in South and Southeast Asia, contain a large number of multiple-encoded strings which are not formally equivalent by any formal mechanism, such as NFC, in the Unicode standard. This leads to a variety of downstream issues, where two strings in the same script that appear identical to end users may in fact have multiple non-equivalent encoded representations.
Nisaba is a collection of finite-state transducer-based (FST) based open-source libraries in Python and C++ for the normalization of such visually equivalent strings of various scripts from South Asia and beyond. It also provides APIs for well-formedness and fully reversible romanization.
The main objective of this presentation is to raise awareness of the Nisaba library, which can be a valuable resource for i18n and NLP projects dealing with languages from South Asia.