Learning meaningful representations of protein sequences

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Standard

Learning meaningful representations of protein sequences. / Detlefsen, Nicki Skafte; Hauberg, Søren; Boomsma, Wouter.

I: Nature Communications, Bind 13, Nr. 1, 1914, 2022, s. 1-12.

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Harvard

Detlefsen, NS, Hauberg, S & Boomsma, W 2022, 'Learning meaningful representations of protein sequences', Nature Communications, bind 13, nr. 1, 1914, s. 1-12. https://doi.org/10.1038/s41467-022-29443-w

APA

Detlefsen, N. S., Hauberg, S., & Boomsma, W. (2022). Learning meaningful representations of protein sequences. Nature Communications, 13(1), 1-12. [1914]. https://doi.org/10.1038/s41467-022-29443-w

Vancouver

Detlefsen NS, Hauberg S, Boomsma W. Learning meaningful representations of protein sequences. Nature Communications. 2022;13(1):1-12. 1914. https://doi.org/10.1038/s41467-022-29443-w

Author

Detlefsen, Nicki Skafte ; Hauberg, Søren ; Boomsma, Wouter. / Learning meaningful representations of protein sequences. I: Nature Communications. 2022 ; Bind 13, Nr. 1. s. 1-12.

Bibtex

@article{22be369080b94e2c8d97dfa5ddc40b17,
title = "Learning meaningful representations of protein sequences",
abstract = "How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.",
author = "Detlefsen, {Nicki Skafte} and S{\o}ren Hauberg and Wouter Boomsma",
note = "Publisher Copyright: {\textcopyright} 2022, The Author(s).",
year = "2022",
doi = "10.1038/s41467-022-29443-w",
language = "English",
volume = "13",
pages = "1--12",
journal = "Nature Communications",
issn = "2041-1723",
publisher = "nature publishing group",
number = "1",

}

RIS

TY - JOUR

T1 - Learning meaningful representations of protein sequences

AU - Detlefsen, Nicki Skafte

AU - Hauberg, Søren

AU - Boomsma, Wouter

N1 - Publisher Copyright: © 2022, The Author(s).

PY - 2022

Y1 - 2022

N2 - How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.

AB - How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.

UR - http://www.scopus.com/inward/record.url?scp=85127918379&partnerID=8YFLogxK

U2 - 10.1038/s41467-022-29443-w

DO - 10.1038/s41467-022-29443-w

M3 - Journal article

C2 - 35395843

AN - SCOPUS:85127918379

VL - 13

SP - 1

EP - 12

JO - Nature Communications

JF - Nature Communications

SN - 2041-1723

IS - 1

M1 - 1914

ER -

ID: 307741382