PhD defence: Peter Mørch Groth

Abstract of the PhD

This thesis explores the application of protein representation learning for navigating fitness landscapes in the context of protein engineering. The aim of protein engineering is to develop useful proteins to solve specific industrial, therapeutic, and environmental problems, often achieved by altering existing proteins to enhance certain properties and abilities. How well-suited a protein is for a given task can collectively be described as its level of fitness. Identifying suitable proteins with high fitness can thus be interpreted as navigating a fitness landscape, where the aim is to reach some task-dependent optimum. Representation learning is a powerful paradigm for leveraging patterns and relations captured from large collections of data to aid downstream modeling, making it particularly suitable for data-scarce protein engineering, where the acquisition of novel experimental data can be prohibitively expensive. The work presented in this thesis examines how protein representation learning can be leveraged for modeling and navigating fitness landscapes in two important settings: globally, where sequences can be highly diverse; and locally, through variant effect prediction, where the impacts of local mutations to a reference protein are modeled.

The scientific contributions presented in this thesis are fourfold:

We establish a benchmark for comparing protein representation paradigms for downstream property prediction, showing no discernible performance differences in a challenging, data-scarce setting.
We introduce a Gaussian process with a novel composite kernel that through transfer learning and biological priors achieves high predictive accuracy for variant effect prediction, while yielding comparatively well-calibrated uncertainties.
We propose an end-to-end retrieval-augmented protein sequence modeling framework that uses vector similarity search in embedding space as a rapid alternative to alignment-based search methods without sacrificing downstream performance.
Lastly, we take a step back and characterize developments in protein sequence modeling from a foundation model perspective, tracing developments through data modalities. In summary, the research presented in this thesis provides several contributions to the field of machine learning for protein engineering, collectively demonstrating that leveraging protein representation learning is a powerful means of navigating both local and global fitness landscapes in the pursuit of protein optimization, discovery, and understanding.

Assessment Committee

Professor Ole Winther, Dept of Biology, University of Copenhagen, Denmark (chair)
Associate Professor Jes Frellsen, DTU Compute, Denmark
Professor Elodie Laine, Sorbonne Université, France

Supervisor:

Wouter Boomsma

Datalogisk Institut

PhD defence: Peter Mørch Groth

Abstract of the PhD

Assessment Committee

Supervisor:

Detaljer