PhD defence by Marloes Elisabeth Arts

Portrait of Marloes

Title

Generative Models for Proteins: Ensembles, Dynamics and Evolution

Abstract

The work presented in this Ph.D. thesis focuses on the intersection of machine learning and protein modelling. This field is thriving due to the emergence of more powerful models and growing amounts of data, with AlphaFold2 as a recent success story in static protein structure prediction that even made its way into the global news. Nonetheless, this static view does not accurately represent the complete picture since proteins are dynamic biomolecules. The papers presented in this thesis zoom in on all dynamic aspects of proteins: structure ensembles, protein dynamics, and protein sequence evolution. All proposed methods are based on generative machine learning models such that new data points can be produced that come from approximately the same distribution as the data seen during training.

The main contributions of this dissertation are threefold. Firstly, we propose a general method to simultaneously impose local and global constraints in protein structure ensemble modelling. As a proof of principle, this method is incorporated into a simple variational autoencoder (VAE) and we demonstrate that the generated samples are of high quality, both locally and globally. The second contribution is a denoising diffusion model based method trained on reduced representations of protein structures from molecular dynamics simulations. Not only can this model produce new samples in a one-shot manner, a force field can also cheaply be extracted to perform new simulations. The final contribution is a closer investigation on the use of VAEs to model protein family sequence data. Specifically, we examine the strengths and weaknesses of Bayesian decoders in this context as well as show the potential of hierarchical VAEs to alleviate the mismatch between the commonly used standard Gaussian prior over latent space and the ``star-shaped'' aggregated posterior for protein family data.

Assessment Committee

Associate Professor Thomas Hamelryck, DIKU & BIO
Professor Søren Hauberg, DTU Compute
Assistant Professor Simon Olsson, Chalmers, Sweden

Moderator of defence: Thomas Hamelryck

Supervisors

Principal Supervisor Wouter Krogh Boomsma

For an electronic copy of the thesis, please visit the PhD Programme page