Mixed-Script confusables in prog.languages
Richard Wordingham
richard.wordingham at ntlworld.com
Thu Dec 15 14:29:26 CST 2016
On Wed, 14 Dec 2016 18:44:39 +0100
Reini Urban <reini at cpanel.net> wrote:
> On Dec 5, 2016, at 3:31 PM, Richard Wordingham
> <richard.wordingham at ntlworld.com> wrote:
> > The choice with PHI includes:
> >
> > U+0278 LATIN SMALL LETTER PHI
> > U+03C6 GREEK SMALL LETTER PHI
> >
> > a Greek (!) script character with compatibiity decomposition to
> > U+03C6
> >
> > U+03D5 GREEK PHI SYMBOL
> >
> > and a whole host of common script characters with compatibility
> > decomposition to U+03C6:
> >
> > U+1D6D7 MATHEMATICAL BOLD SMALL PHI
> > U+1D6DF MATHEMATICAL BOLD PHI SYMBOL
> > U+1D711 MATHEMATICAL ITALIC SMALL PHI
> > U+1D719 MATHEMATICAL ITALIC PHI SYMBOL
> > U+1D74B MATHEMATICAL BOLD ITALIC SMALL PHI
> > U+1D753 MATHEMATICAL BOLD ITALIC PHI SYMBOL
> > U+1D785 MATHEMATICAL SANS-SERIF BOLD SMALL PHI
> > U+1D78D MATHEMATICAL SANS-SERIF BOLD PHI SYMBOL
> > U+1D7BF MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PHI
> > U+1D7C7 MATHEMATICAL SANS-SERIF BOLD ITALIC PHI SYMBOL
> >
> > They are all ID_Start.
>
> Oh my. Dragons beware. So I need to add some trie tables to add
> warnings with those rules also. I don’t want to error on some obscure
> confusables rule only yet. perl doesn’t even ship the security
> tables, so people are not aware of it.
Another solution would be to treat two identifiers as the same if they
have the same NFKC normalisation.
> > You didn't mention the inherited script. Is that automatically
> > allowed, e.g. φ̈ᵣ <U+03C6, U+0308 COMBINING DIAERESIS, U+1D63 LATIN
> > SUBSCRIPT SMALL LETTER R> (scripts: Greek, inherited, Latin)? I
> > encountered that variable name in a radar specification last week.
>
> Inherited is allowed with ID_Continue, yes. Not in ID_Start position.
> Combiners are normalized to NFC.
<U+03C6, U+0308, U+1D63> is unchanged under normalisation to NFC, NFD,
NFKC and NFKD.
> > There might be issues - it's possible that क̐ <U+0915 DEVANAGARI
> > LETTER KA, U+0310 COMBINING CANDRABINDU> might spoof कँ <U+0915,
> > U+0901 DEVANAGARI SIGN CANDRABINDU>.
> \x{915}\x{310} is legal Devanagari normalized to one char.
I don't know know what you mean by this statement. <U+0915, U+0310> is
also unchanged under the 4 normalisations.
> \x{915}\x{901} are two legal Devanagari characters.
> but they are confusables. This would need special confusable rules.
Additionally, U+0310 can be confused quite readily with the sequence
<U+0306 COMBINING BREVE, U+0307 COMBINING DOT ABOVE>.
Richard.
More information about the Unicode
mailing list