Re: Mixed-Script confusables in prog.languages from Richard Wordingham on 2016-12-04 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sun, 4 Dec 2016 22:45:58 +0000

On Sun, 4 Dec 2016 12:09:36 +0100
Reini Urban <reini_at_cpanel.net> wrote:

> * normalize identifiers (NFC) and only store normalized variants.
> this should catch bidi spoofs, combining characters and such.

That doesn't catch bidi spoofs.

> * check each unicode code point for its Script property and besides
> Latin, Common and Inherited only allow the first script, but error on
> any other mixed script. Additional scripts need to be declared.
> https://github.com/perl11/cperl/issues/229
>
> in perl like this:
> use utf8 ‘Greek’, ‘Cyrillic’;

Your rule isn't clear. Would an identifier like ψ_S be automatically
allowed?

I presume you're handling the spoofing of the SMALL PHI characters by
other means.

For multilingual support, you would want rules more like

'After script X, allow script Y'.

> Of course there exist several languages which require more than one
> script,
<snip>
> or african languages as some have other than Latin roots, e.g.
> Ethiopian from Semitic.

I don't see your problem here. What problem do you see with Amharic?

> Indian languages also sound problematic,

Is this the ZWJ/ZWNJ issue? That surely is a problem within a script.

> and
> all the Old_<script>

Now I am confused. What problem do you see that you don't have in the
Latin script?

Richard.
Received on Sun Dec 04 2016 - 16:46:32 CST

This archive was generated by hypermail 2.2.0 : Sun Dec 04 2016 - 16:46:33 CST