Re: Swift from Philippe Verdy on 2014-06-10 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 10 Jun 2014 09:03:35 +0200

variation selectors are within the subset of characters that should never
be permitted in programming identifiers; they could cause surprizing
results such as adding new APIs or backdoors that would not be detected by
code reviewers looking at the code.

But you allow them in the language, the first thing you'll need to
integrate in your project is a source code scanner that will work on
detecting unsafe characters (including checking the list of confusables,
and enforcing the normalization of that source code before compiling it, as
text editors may break these normalizations unexpectedly). Such tool should
run in a routine, just like there are tools that perform reindentation of
code and enforce some common conventions for its presentation; in order to
ease exploration/searches in the source code, review, and facilitate use of
regexps in editors as well. There are various tools that will also inspect
how well the code is documented or if documentation os missing about
publicly exposed variables and API and try to infer dependencies. Such
tools fall in the same categories as the old "lint" tool for C (almost
deprecated the way it was since now most of its rules are integrated in the
language itself and by compilers to ensure type safety).

The risk however is higher in untyped or weakly typed languages like
Javascript/ECMAScript where all objects can be surcharged freely, that
confusing identifiers could create unseen security risks.

Note that identifiers are not just within programming languages; they exist
as well on other types of APIs (and notably within web APIs within
protocols transmitting data such as encoded web forms, even if these
identifiers will be used and exposed isolately in a true language such as
JSON or HTML or XML or CSS, possibly also with some escaping mechanisms).

Also "identifiers" should be interpreted broadly to include symbolic
operators (e.g. operators) if the language or API allow their extension or
surcharge or derivations (Unicode identifiers or identifiers used in
classic languages like HML, XML, C/C++, Java, Cobol, Fortran, Ada, PHP,
Python... or assembly languages. are more restricted in their allowed
repertoire and all other extensions require explicit escaping whose
decoding should not weaken the security).

Identifiers for data may be very liberal (e.g. if we want to allow toponyms
or people names or trademarks) as they will frequently need to use
significnat punctuation or symbols as well as spacing or word separation.
This is even more critical for work names/titles, pagenames and filenames
in an open collection: these identifiers or names should resolve
unambiguously to the document or data intended (and generally this implies
develping naming conventions and some required classification system to get
an accurate invorory of available data; and make it possible to inspect
this inventory and detect undesirable/conflicting items). I am then
convinced than for such open inventories or collections, normalization
should never be an option, it should be enforced and automated as early as
possible even if we admit input in non-normalized forms.

Any programming language or protocol that considers using a large
repertoire from the UCS should seriously look at the specification of
security in the Unicode standard and its annexes, and conside what has been
made and discussed for maintaining the security of the worldwide DNS within
the IDNA.

The risks coming from instability of normalizations if you allow unassigned
codepoints are real because they they can be easily used by automated tools
and human reviewers will not detect these attacks easily without using
tools to check these normalizations. Code checkers should immediately alarm
about usage of codepoints not assigned in a known version of the UCS; and
if they upgrade that version, thay should make sure that all other tools in
that chain will check the same version.

But some identifiers are not always found in source code but are generated
at runtime using dyna,ic language features (dynamic binder libraries or
reflection APIs should also perform their own check and will then need to
integrate a minmum database of known assigned code points for that UCS
version, and this could cause some complications for maintaining
compatibility; notably having a version negociation mechanism and
integrating the version property of assigned codepoints)

2014-06-09 4:46 GMT+02:00 Norbert Lindenberg <unicode_at_norbertlindenberg.com>
:

> It does allow some usage that may surprise code reviewers â€“ for example,
> this is a valid Swift program:
>
> let s = "ðŸ˜„"
> let sï¸€ = "ðŸ˜ž"
> let ï¸€ = "ðŸ˜‰"
> let all = s + sï¸€ + ï¸€
>
> The value of the constant â€œallâ€ is "ðŸ˜„ðŸ˜žðŸ˜‰". Or at least it is as long as
> mail software doesnâ€™t harm the variation selectorsâ€¦
>
> Norbert
>
>
> On Jun 5, 2014, at 9:06 , Mark Davis â˜•ï¸ <mark_at_macchiato.com> wrote:
>
> > I haven't done any analysis, but on first glance it looks like it is
> based on
> >
> > http://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax
> >
> >
> > Mark
> >
> > â€” Il meglio Ã¨ lâ€™inimico del bene â€”
> >
> >
> > On Thu, Jun 5, 2014 at 5:46 PM, Jeff Senn <senn_at_maya.com> wrote:
> > Has anyone figured out whether character sequences that are
> non-canonical (de)compositions but could be recomposed to the same result
> > are the same identifier or not?
> >
> > That is: are identifiers merely sequences of characters or intended to
> be comparable as â€œUnicode stringsâ€ (under some sort of compatibility rule)?
> >
> > On Jun 5, 2014, at 11:27 AM, Martin v. LÃ¶wis <martin_at_v.loewis.de> wrote:
> >
> > > Am 04.06.14 11:28, schrieb Andre Schappo:
> > >> The restrictions seem a little like IDNA2008. Anyone have links to
> > >> info giving a detailed explanation/tabulation of allowed and non
> > >> allowed Unicode chars for Swift Variable and Constant names?
> > >
> > > The language reference is at
> > >
> > >
> https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html
> > >
> > > For reference, the definition of identifier-character is (read each
> > > line as an alternative)
> > >
> > > identifier-character â†’ Digit 0 through 9
> > > identifier-character â†’ U+0300â€“U+036F, U+1DC0â€“U+1DFF, U+20D0â€“U+20FF, or
> > > U+FE20â€“U+FE2F
> > > identifier-character â†’ identifier-headÂ
> > >
> > > where identifier-head is
> > >
> > > identifier-head â†’ Upper- or lowercase letter A through Z
> > > identifier-head â†’ U+00A8, U+00AA, U+00AD, U+00AF, U+00B2â€“U+00B5, or
> > > U+00B7â€“U+00BA
> > > identifier-head â†’ U+00BCâ€“U+00BE, U+00C0â€“U+00D6, U+00D8â€“U+00F6, or
> > > U+00F8â€“U+00FF
> > > identifier-head â†’ U+0100â€“U+02FF, U+0370â€“U+167F, U+1681â€“U+180D, or
> > > U+180Fâ€“U+1DBF
> > > identifier-head â†’ U+1E00â€“U+1FFF
> > > identifier-head â†’ U+200Bâ€“U+200D, U+202Aâ€“U+202E, U+203Fâ€“U+2040, U+2054,
> > > or U+2060â€“U+206F
> > > identifier-head â†’ U+2070â€“U+20CF, U+2100â€“U+218F, U+2460â€“U+24FF, or
> > > U+2776â€“U+2793
> > > identifier-head â†’ U+2C00â€“U+2DFF or U+2E80â€“U+2FFF
> > > identifier-head â†’ U+3004â€“U+3007, U+3021â€“U+302F, U+3031â€“U+303F, or
> > > U+3040â€“U+D7FF
> > > identifier-head â†’ U+F900â€“U+FD3D, U+FD40â€“U+FDCF, U+FDF0â€“U+FE1F, or
> > > U+FE30â€“U+FE44
> > > identifier-head â†’ U+FE47â€“U+FFFD
> > > identifier-head â†’ U+10000â€“U+1FFFD, U+20000â€“U+2FFFD, U+30000â€“U+3FFFD, or
> > > U+40000â€“U+4FFFD
> > > identifier-head â†’ U+50000â€“U+5FFFD, U+60000â€“U+6FFFD, U+70000â€“U+7FFFD, or
> > > U+80000â€“U+8FFFD
> > > identifier-head â†’ U+90000â€“U+9FFFD, U+A0000â€“U+AFFFD, U+B0000â€“U+BFFFD, or
> > > U+C0000â€“U+CFFFD
> > > identifier-head â†’ U+D0000â€“U+DFFFD or U+E0000â€“U+EFFFD
> > >
> > > As the construction principle for this list, they say
> > >
> > > "Identifiers begin with an upper case or lower case letter A through Z,
> > > an underscore (_), a noncombining alphanumeric Unicode character in the
> > > Basic Multilingual Plane, or a character outside the Basic Multilingual
> > > Plan that isnâ€™t in a Private Use Area. After the first character,
> digits
> > > and combining Unicode characters are also allowed."
> > >
> > > Regards,
> > > Martin
> > > _______________________________________________
> > > Unicode mailing list
> > > Unicode_at_unicode.org
> > > http://unicode.org/mailman/listinfo/unicode
> >
> >
> > _______________________________________________
> > Unicode mailing list
> > Unicode_at_unicode.org
> > http://unicode.org/mailman/listinfo/unicode
> >
> > _______________________________________________
> > Unicode mailing list
> > Unicode_at_unicode.org
> > http://unicode.org/mailman/listinfo/unicode
>
>
> _______________________________________________
> Unicode mailing list
> Unicode_at_unicode.org
> http://unicode.org/mailman/listinfo/unicode
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Tue Jun 10 2014 - 02:05:25 CDT

This archive was generated by hypermail 2.2.0 : Tue Jun 10 2014 - 02:05:26 CDT