Context dependent prediction in DNA sequence using neural networks

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Context dependent prediction in DNA sequence using neural networks. / Grønbæk, Christian; Liang, Yuhu; Elliott, Desmond; Krogh, Anders.

In: PeerJ, Vol. 10, e13666, 2022.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Grønbæk, C, Liang, Y, Elliott, D & Krogh, A 2022, 'Context dependent prediction in DNA sequence using neural networks', PeerJ, vol. 10, e13666. https://doi.org/10.7717/peerj.13666

APA

Grønbæk, C., Liang, Y., Elliott, D., & Krogh, A. (2022). Context dependent prediction in DNA sequence using neural networks. PeerJ, 10, [e13666]. https://doi.org/10.7717/peerj.13666

Vancouver

Grønbæk C, Liang Y, Elliott D, Krogh A. Context dependent prediction in DNA sequence using neural networks. PeerJ. 2022;10. e13666. https://doi.org/10.7717/peerj.13666

Author

Grønbæk, Christian ; Liang, Yuhu ; Elliott, Desmond ; Krogh, Anders. / Context dependent prediction in DNA sequence using neural networks. In: PeerJ. 2022 ; Vol. 10.

Bibtex

@article{c36121e7225141b493d6987c0c00047c,
title = "Context dependent prediction in DNA sequence using neural networks",
abstract = "One way to better understand the structure in DNA is by learning to predict the sequence. Here, we trained a model to predict the missing base at any given position, given its left and right flanking contexts. Our best-performing model was a neural network that obtained an accuracy close to 54% on the human genome, which is 2% points better than modelling the data using a Markov model. In likelihood-ratio tests, the neural network performed significantly better than any of the alternative models by a large margin. We report on where the accuracy was obtained, first observing that the performance appeared to be uniform over the chromosomes. The models performed best in repetitive sequences, as expected, although their performance far from random in the more difficult coding sections, the proportions being ~70:40%. We further explored the sources of the accuracy, Fourier transforming the predictions revealed weak but clear periodic signals. In the human genome the characteristic periods hinted at connections to nucleosome positioning. We found similar periodic signals in GC/AT content in the human genome, which to the best of our knowledge have not been reported before. On other large genomes similarly high accuracy was found, while lower predictive accuracy was observed on smaller genomes. Only in the mouse genome did we see periodic signals in the same range as in the human genome, though weaker and of a different type. This indicates that the sources of these signals are other or more than nucleosome arrangement. Interestingly, applying a model trained on the mouse genome to the human genome resulted in a performance far below that of the human model, except in the difficult coding regions. Despite the clear outcomes of the likelihood-ratio tests, there is currently a limited superiority of the neural network methods over the Markov model. We expect, however, that there is great potential for better modelling DNA using different neural network architectures.",
keywords = "Dna, Neural networks, Patterns, Predictability, Signals of periodicity",
author = "Christian Gr{\o}nb{\ae}k and Yuhu Liang and Desmond Elliott and Anders Krogh",
note = "Publisher Copyright: {\textcopyright} Copyright 2022 Gr{\o}nb{\ae}k et al.",
year = "2022",
doi = "10.7717/peerj.13666",
language = "English",
volume = "10",
journal = "PeerJ",
issn = "2167-8359",
publisher = "PeerJ",

}

RIS

TY - JOUR

T1 - Context dependent prediction in DNA sequence using neural networks

AU - Grønbæk, Christian

AU - Liang, Yuhu

AU - Elliott, Desmond

AU - Krogh, Anders

N1 - Publisher Copyright: © Copyright 2022 Grønbæk et al.

PY - 2022

Y1 - 2022

N2 - One way to better understand the structure in DNA is by learning to predict the sequence. Here, we trained a model to predict the missing base at any given position, given its left and right flanking contexts. Our best-performing model was a neural network that obtained an accuracy close to 54% on the human genome, which is 2% points better than modelling the data using a Markov model. In likelihood-ratio tests, the neural network performed significantly better than any of the alternative models by a large margin. We report on where the accuracy was obtained, first observing that the performance appeared to be uniform over the chromosomes. The models performed best in repetitive sequences, as expected, although their performance far from random in the more difficult coding sections, the proportions being ~70:40%. We further explored the sources of the accuracy, Fourier transforming the predictions revealed weak but clear periodic signals. In the human genome the characteristic periods hinted at connections to nucleosome positioning. We found similar periodic signals in GC/AT content in the human genome, which to the best of our knowledge have not been reported before. On other large genomes similarly high accuracy was found, while lower predictive accuracy was observed on smaller genomes. Only in the mouse genome did we see periodic signals in the same range as in the human genome, though weaker and of a different type. This indicates that the sources of these signals are other or more than nucleosome arrangement. Interestingly, applying a model trained on the mouse genome to the human genome resulted in a performance far below that of the human model, except in the difficult coding regions. Despite the clear outcomes of the likelihood-ratio tests, there is currently a limited superiority of the neural network methods over the Markov model. We expect, however, that there is great potential for better modelling DNA using different neural network architectures.

AB - One way to better understand the structure in DNA is by learning to predict the sequence. Here, we trained a model to predict the missing base at any given position, given its left and right flanking contexts. Our best-performing model was a neural network that obtained an accuracy close to 54% on the human genome, which is 2% points better than modelling the data using a Markov model. In likelihood-ratio tests, the neural network performed significantly better than any of the alternative models by a large margin. We report on where the accuracy was obtained, first observing that the performance appeared to be uniform over the chromosomes. The models performed best in repetitive sequences, as expected, although their performance far from random in the more difficult coding sections, the proportions being ~70:40%. We further explored the sources of the accuracy, Fourier transforming the predictions revealed weak but clear periodic signals. In the human genome the characteristic periods hinted at connections to nucleosome positioning. We found similar periodic signals in GC/AT content in the human genome, which to the best of our knowledge have not been reported before. On other large genomes similarly high accuracy was found, while lower predictive accuracy was observed on smaller genomes. Only in the mouse genome did we see periodic signals in the same range as in the human genome, though weaker and of a different type. This indicates that the sources of these signals are other or more than nucleosome arrangement. Interestingly, applying a model trained on the mouse genome to the human genome resulted in a performance far below that of the human model, except in the difficult coding regions. Despite the clear outcomes of the likelihood-ratio tests, there is currently a limited superiority of the neural network methods over the Markov model. We expect, however, that there is great potential for better modelling DNA using different neural network architectures.

KW - Dna

KW - Neural networks

KW - Patterns

KW - Predictability

KW - Signals of periodicity

U2 - 10.7717/peerj.13666

DO - 10.7717/peerj.13666

M3 - Journal article

C2 - 36157058

AN - SCOPUS:85138444528

VL - 10

JO - PeerJ

JF - PeerJ

SN - 2167-8359

M1 - e13666

ER -

ID: 321473657