Algorithms for Unicode script detection

This message: [ Message body ] [ Respond ] [ More options ]
Related messages: [ Next message ] [ Previous message ] [ Next in thread ] [ Replies ]

From: Simon Cozens via Unicode <unicode_at_unicode.org>
Date: Thu, 6 Jul 2017 09:43:29 +1000

I want to segment a Unicode text into runs according to their script.
I've had a look through UAX#24 in the hope of finding a standard
algorithm for doing this, but there isn't one specified. The
implementation section gives some good pointers for what to be careful
with (paired punctuation, etc.) but I can't find a step-by-step
algorithm similar to the bidi algorithm or collation algorithm.

Equally, I don't see anything in ICU that segments into script-based
runs. You can get script properties, but that doesn't help you resolve
common characters in the context of a run.

Does anyone know of an open-source algorithm for doing this?
Received on Wed Jul 05 2017 - 18:43:59 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 05 2017 - 18:43:59 CDT