Mixed-Script confusables in prog.languages
reini at cpanel.net
Sun Dec 4 05:09:36 CST 2016
I’m working on adding Mixed-Script confusable protection to a programming language,
cperl a perl5 fork, for security reasons, for its identifiers.
i.e. variable names, package names, function names, literals.
This is a bit different to the typical use cases of libidna, in email or browsers.
Is anybody aware of any other language implementation, which does confusable or mixed-script protection?
I think R has something, because it has this header:
but I found nothing else, which is quite annoying.
My approach is as following:
* normalize identifiers (NFC) and only store normalized variants. this should catch bidi spoofs, combining characters and such.
* check each unicode code point for its Script property and besides Latin, Common and Inherited
only allow the first script, but error on any other mixed script. Additional scripts need to be declared.
in perl like this:
use utf8 ‘Greek’, ‘Cyrillic’;
utf8 is a pragma to allow unicode identifiers, not strings, to be added to the symbol table.
Obviously this has risks when reviewing a codebase, which might even bypass test suites.
This is fast enough, and has no measurable costs in the parser.
unicode has a nice security/confusable.txt table which could be used for more fine-grained checks, yes.
But I fear this is too much overhead for the generic parser, and I think that avoiding the
problem by forbidding/need to declare mixed scripts is much easier, and more declarative.
Of course there exist several languages which require more than one script, like
Japanese = Hiragana and Katakana and maybe Han,
Korean = Hangul + Han, …
or african languages as some have other than Latin roots, e.g. Ethiopian from Semitic.
Indian languages also sound problematic, and all the Old_<script>
For these I just add aliases to allow multiple Scripts.
rurban at cpanel.net
More information about the Unicode