A systematic analysis of regression models for protein engineering
Research output: Contribution to journal › Journal article › Research › peer-review
Standard
A systematic analysis of regression models for protein engineering. / Michael, Richard; Kæstel-Hansen, Jacob; Groth, Peter Mørch; Bartels, Simon; Salomon, Jesper; Tian, Pengfei; Hatzakis, Nikos S.; Boomsma, Wouter.
In: PLOS Computational Biology, Vol. 20, No. 5 May, e1012061, 2024.Research output: Contribution to journal › Journal article › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - JOUR
T1 - A systematic analysis of regression models for protein engineering
AU - Michael, Richard
AU - Kæstel-Hansen, Jacob
AU - Groth, Peter Mørch
AU - Bartels, Simon
AU - Salomon, Jesper
AU - Tian, Pengfei
AU - Hatzakis, Nikos S.
AU - Boomsma, Wouter
N1 - Publisher Copyright: © 2024 Michael et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2024
Y1 - 2024
N2 - To optimize proteins for particular traits holds great promise for industrial and pharmaceutical purposes. Machine Learning is increasingly applied in this field to predict properties of proteins, thereby guiding the experimental optimization process. A natural question is: How much progress are we making with such predictions, and how important is the choice of regressor and representation? In this paper, we demonstrate that different assessment criteria for regressor performance can lead to dramatically different conclusions, depending on the choice of metric, and how one defines generalization. We highlight the fundamental issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance. Finally, we make the case for the importance of calibrated uncertainty in this domain.
AB - To optimize proteins for particular traits holds great promise for industrial and pharmaceutical purposes. Machine Learning is increasingly applied in this field to predict properties of proteins, thereby guiding the experimental optimization process. A natural question is: How much progress are we making with such predictions, and how important is the choice of regressor and representation? In this paper, we demonstrate that different assessment criteria for regressor performance can lead to dramatically different conclusions, depending on the choice of metric, and how one defines generalization. We highlight the fundamental issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance. Finally, we make the case for the importance of calibrated uncertainty in this domain.
U2 - 10.1371/journal.pcbi.1012061
DO - 10.1371/journal.pcbi.1012061
M3 - Journal article
C2 - 38701099
AN - SCOPUS:85192312471
VL - 20
JO - P L o S Computational Biology (Online)
JF - P L o S Computational Biology (Online)
SN - 1553-734X
IS - 5 May
M1 - e1012061
ER -
ID: 392107551