Learning meaningful representations of protein sequences

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 6,11 MB, PDF-dokument

Nicki Skafte Detlefsen
Søren Hauberg
Boomsma, Wouter

How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.

Originalsprog	Engelsk
Artikelnummer	1914
Tidsskrift	Nature Communications
Vol/bind	13
Udgave nummer	1
Sider (fra-til)	1-12
ISSN	2041-1723
DOI	https://doi.org/10.1038/s41467-022-29443-w
Status	Udgivet - 2022

Bibliografisk note

Funding Information:
This work was funded in part by the Novo Nordisk Foundation through the MLLS Center (Basic Machine Learning Research in Life Science, NNF20OC0062606). It also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (757360). NSD and SH were supported in part by a research grant (15334) from VILLUM FONDEN. WB was supported by a project grant from the Novo Nordisk Foundation (NNF18OC0052719). We thank Ole Winther, Jesper Ferkinghoff-Borg, and Jesper Salomon for feedback on earlier versions of this manuscript. Finally, we gratefully acknowledge the support of NVIDIA Corporation with the donation of GPU hardware used for this research.

Publisher Copyright:
© 2022, The Author(s).

Antal downloads er baseret på statistik fra Google Scholar og www.ku.dk

Ingen data tilgængelig

ID: 307741382

Datalogisk Institut