Program Structure



Technology Basics & Python Best Practices

Python developed in the cheminformatics and machine learning community to the most important programming language. It is easy to get started, but to increase efficiency, reproducibility, and for enabling collaboration, a few aspects need to be considered already at an early stage of code development.
In this first session we introduce and explore the usage of an IDE (Integrated Development Environment), which is a software that was created to make the process easier to write, debug, and to test code in a single environment. We show how to create, activate, and install packages into a Python Virtual Environment. In a subsequent step, we give an overview of key features of Github, and we practice how to create a project, use it for version control and collaborative coding.

Chemical Data Science

In this session we will introduce the most important concepts for working with chemical data using Python. We will look at some of the different file formats in order to understand what they are good (and not so good) for, and then move on to look at some common use cases when working with sets of molecules:- cleaning up a set of molecules in order to get it ready for further analysis
- drawing molecules
- removing duplicates from a set of molecules
- doing substructure searches
- generating fingerprints

We will use the RDKit ( and Jupyter notebooks for the hands-on part of the session.

Machine Learning Basics

In this session we will cover the fundamentals of machine learning (ML) and its application to chemical problems. The goal is to teach you how to avoid some of the common mistakes people make when applying ML techniques and to give you a foundation you can build on if you want to learn more later. Some of the topics we'll discuss include:

- use cases for machine learning and types of models
- molecular representations: fingerprints and descriptors
- metrics for model quality
- building and testing models
- intro to some key ML algorithms

During the hands-on part of the session we will work with a real dataset using the RDKit (, sklearn (, and Jupyter notebooks.

Pipelines for ligand- and structure-based design

In the previous sessions, you have become familiar with programmatic work on molecular data and have acquired some basic knowledge of machine learning. Next, we will explore how we can use this knowledge for computer-aided drug design (CADD).
We will introduce several concepts in the area of ligand-based drug design, scraping compound data from databases (e.g. ChEMBL or PDB), comparing molecules, filtering them (e.g. by ADMET criteria) or detecting meaningful substructures, as well as ligand-based virtual screening, i.e., building your own models for compound activity prediction against a target of choice.
In addition, concepts for structure-based pipelines, including workflows for docking studies, protein-ligand interaction determination and molecular dynamics simulations will be introduced.
The whole session will build on the material of our TeachOpenCADD Plattform (, in which each topic - theory and code - is provided in an easy to follow jupyter notebook (called talktorial). We invite you to also have a look at the talktorials before the meeting.

Beyond software: CADD behind the scenes

In contrast to the other sessions, where you learned about different concepts & technologies and how to apply them in a CADD setting, we will focus now on the discussion of results you obtain from those tools in a concrete project-like setting. Nowadays, computational tools have evolved a lot in the sense, that their application is often just a button click away, and by keeping "standard settings" almost everybody can retrieve some "computational results/predictions" on their target. But actually, that's only when the work of a CADD scientists starts: We will have different examples ready to analyse those initial results, which aspects could be considered or discussed in depth and how to derive working hypotheses from them.

Reaction/Language Models

Natural language processing models in organic chemistry have emerged as one of the most effective, scalable approaches for capturing human knowledge and modelling chemical processes. Its use in machine learning tasks demonstrated high quality and ease of use in problems such as predicting chemical reactions [1-2], retrosynthetic routes [3], digitizing chemical literature [4], predicting detailed experimental procedures [5], designing new fingerprints [6] and yield predictions [7]. In this session, we will cover the impact of language models in chemistry by highlighting the critical role of NLP architectures in a wide range of digital chemistry tasks and put them into practice with few selected examples.