Re: Mixed-Script confusables in prog.languages from Reini Urban on 2016-12-14 (Unicode Mail List Archive)

From: Reini Urban <reini_at_cpanel.net>
Date: Wed, 14 Dec 2016 18:44:39 +0100

> On Dec 5, 2016, at 3:31 PM, Richard Wordingham <richard.wordingham_at_ntlworld.com> wrote:
>
> On Mon, 5 Dec 2016 09:31:11 +0100
> Reini Urban <reini_at_cpanel.net> wrote:
>
>>> On Dec 4, 2016, at 11:45 PM, Richard Wordingham
>>> <richard.wordingham_at_ntlworld.com> wrote:
>>>
>>> On Sun, 4 Dec 2016 12:09:36 +0100
>>> Reini Urban <reini_at_cpanel.net> wrote:
>>>
>>>> * normalize identifiers (NFC) and only store normalized variants.
>>>> this should catch bidi spoofs, combining characters and such.
>>>
>>> That doesn't catch bidi spoofs.
>>
>> Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.
>>
>> i.e. ‮goog‬le <U+202E (right-to-left override), g, o, o, g, U+202C
>> (pop directional formatting), l, e> is already caught as illegal.
>>
>> Mixing RTL scripts, such as Arabic with Latin is not caught with the
>> mixed-script rule per se.
>>
>>>> * check each unicode code point for its Script property and besides
>>>> Latin, Common and Inherited only allow the first script, but error
>>>> on any other mixed script. Additional scripts need to be declared.
>>>> https://github.com/perl11/cperl/issues/229
>>>>
>>>> in perl like this:
>>>> use utf8 ‘Greek’, ‘Cyrillic’;
>>>
>>> Your rule isn't clear. Would an identifier like ψ_S be
>>> automatically allowed?
>>
>> ψ_S contains Greek U+03C8, Common and Latin. Since Latin and Common
>> are always allowed, the only new script is Greek. The first
>> non-default script is automatically and silently allowed, only a mix
>> with another non-default script, such as Cyrillic would error or need
>> an explicit declaration.
>>
>> So ψ_S alone is fine, if everything else is Greek.
>> But mixing with the Cyrillic version would lead to an error.
>>
>>> I presume you're handling the spoofing of the SMALL PHI characters
>>> by other means.
>>
>> The spoof attempt would be ѱ_S with Cyrillic U+0471, Common, Latin.
>> 2 mixed scripts which are illegal, if undeclared.
>> Same with PHI, which exists as Greek or Cyrillic. Most of Greek
>> characters have confusable Cyrillic counterparts, that’s why a
>> declaration of use utf8 ‘Greek’, ‘Cyrillic’; i.e. mixing those two
>> sounds highly dangerous. With the UCD confusable table this would be
>> an error. In my rule not, since the user declared those two scripts
>> to be mixed.
>
> The choice with PHI includes:
>
> U+0278 LATIN SMALL LETTER PHI
> U+03C6 GREEK SMALL LETTER PHI
>
> a Greek (!) script character with compatibiity decomposition to U+03C6
>
> U+03D5 GREEK PHI SYMBOL
>
> and a whole host of common script characters with compatibility
> decomposition to U+03C6:
>
> U+1D6D7 MATHEMATICAL BOLD SMALL PHI
> U+1D6DF MATHEMATICAL BOLD PHI SYMBOL
> U+1D711 MATHEMATICAL ITALIC SMALL PHI
> U+1D719 MATHEMATICAL ITALIC PHI SYMBOL
> U+1D74B MATHEMATICAL BOLD ITALIC SMALL PHI
> U+1D753 MATHEMATICAL BOLD ITALIC PHI SYMBOL
> U+1D785 MATHEMATICAL SANS-SERIF BOLD SMALL PHI
> U+1D78D MATHEMATICAL SANS-SERIF BOLD PHI SYMBOL
> U+1D7BF MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PHI
> U+1D7C7 MATHEMATICAL SANS-SERIF BOLD ITALIC PHI SYMBOL
>
> They are all ID_Start.

Oh my. Dragons beware. So I need to add some trie tables to add warnings with those rules also.
I don’t want to error on some obscure confusables rule only yet.
perl doesn’t even ship the security tables, so people are not aware of it.

> You didn't mention the inherited script. Is that automatically
> allowed, e.g. φ̈ᵣ <U+03C6, U+0308 COMBINING DIAERESIS, U+1D63 LATIN
> SUBSCRIPT SMALL LETTER R> (scripts: Greek, inherited, Latin)? I
> encountered that variable name in a radar specification last week.

Inherited is allowed with ID_Continue, yes. Not in ID_Start position.
Combiners are normalized to NFC.

> There might be issues - it's possible that क̐ <U+0915 DEVANAGARI LETTER
> KA, U+0310 COMBINING CANDRABINDU> might spoof कँ <U+0915, U+0901
> DEVANAGARI SIGN CANDRABINDU>.

Good test case:

\x{915}\x{310} is legal Devanagari normalized to one char.
\x{915}\x{901} are two legal Devanagari characters.
but they are confusables. This would need special confusable rules.

>
>>> For multilingual support, you would want rules more like
>>>
>>> 'After script X, allow script Y’.
>>
>> Can you expand on that with an example? I’m no expert on this.
>>
>> Like after Hangul, allow Han? After Hiragana, allow Katakana?
>
> It allows one to mix Japanese and Korean variables without being able
> to kana and Hangul.
>
> Some of the Semitic abjads are sometimes used with vowel symbols
> normally assoicated with a different Semitic script. One could use
> such a construct to limit the mixing. However, for such cases a rule
> such as 'allow script Y marks on script X bases' would be much better.
>
>>> I don't see your problem here. What problem do you see with
>>> Amharic?
>
>> Amharic is not defined as UCD script property. It’s alphabet is
>> called Ge’ez, which we call Ethiopic in the UCD. But that’s all I
>> know. I’m not a domain expert. Does Ethiopic uses other Semitic
>> scripts in its alphabet or is it complete? I learned some CFK
>> languages, where you historically allow mixed scripts. But for other
>> scripts I’m clueless. The examples I got mix it with Runic. Valid or
>> nonsense?
>
> I would say nonsense - or graphic design. The use of Chinese
> ideographs alongside sinoform scripts is the primary example.
> However, 'symbols' as opposed to letters may leak from one script to
> another, and that may be an issue for variable names. For example,
> English can use Arabic numerals, Roman numerals or Roman letters for
> numbering in lists, and I've known people to resort to Greek letters.
> Accent marks can also move, though these are usually encoded
> separately. I've already used the example of candrabindu being
> borrowed from the Devanagari script to the Latin script - it was
> borrowed for use in Sanskrit.
>
>> How about the many Indian scripts? Do they mix?
>
> Microsoft mostly won't let long-supported *Indian* scripts mix within
> syllables.
>
> I would say they mixed in much the same way as the Latin and Cyrillic
> scripts mix. In many ways they act as font variants of one another, so
> features and rare letters may move between them. This is most
> noticeable where large chunks of the Brahmi character set are missing,
> such as Tamil and Lao. For Tamil, the gaps may be filled by 'Grantha'
> letters. For Lao, subscript consonants bear a very strong resemblance
> to the Tai Tham subscript forms. On the other hand, the unencoded
> characters added to the Lao script to support Pali have been
> well harmonised to the Lao script, and using characters from other
> scripts for them would definitely be wrong. (There's mostly a
> consensus as to what the right bogus coding for them within the Lao
> block is. Unfortunately, I don't have good enough evidence for an
> encoding proposal.)

I see. This would be a fine case for needed declaration of those mixed scripts.
Similar to East-Asian.

>> That I have no idea if those Old_<script> alphabets are still in use
>> to create aliases for them.
>
> They'll still be in use. We had a guy at work (computer department)
> who kept notes on his whiteboard in runes. Someone analysing cuneiform
> texts might very well want to create variable names that are a mix of
> Latin for function (as 'n_' = "number of") and cuneiform for the form
> being counted or whatever.
>
>> Such as this perl test t/mro/isa_c3_utf8.t
>>
>> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam
>> Hiragana );
>>
>> ...
>> package 캎oẃ;
>> package urḲḵｋ;
>> @urḲḵｋ::ISA = 'kഌoんḰ';
>> package к;
>> @urḲḵｋ::ISA = ('kഌoんḰ', '캎oẃ');
>> package ṭ화ckэ;
>> ...
>>
>> These identifiers are unreadable, because I don’t assume that anybody
>> will be able to understand Hangul Cyrillic Ethiopic
>> Canadian_Aboriginal Malayalam and Hiragana at once. I understand a
>> bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal
>> to me.
>
> There's no law against it! More to the point, it was just a test.

I declared it as no undeclared mixed-script law in my language :)

> However, allowing Cyrillic or Greek immediately makes every apparent
> 'o' (or 'A') a potential spoof. Remember, "Letter 'O' Considered
> Harmful”.

Yes, this should be warned about.
Received on Wed Dec 14 2016 - 11:45:12 CST

This archive was generated by hypermail 2.2.0 : Wed Dec 14 2016 - 11:45:12 CST