Large-scale prokaryotic gene prediction and comparison to genome annotation

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Large-scale prokaryotic gene prediction and comparison to genome annotation. / Nielsen, Pernille; Krogh, Anders Stærmose.

In: Bioinformatics, Vol. 21, No. 24, 2005, p. 4322-9.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Nielsen, P & Krogh, AS 2005, 'Large-scale prokaryotic gene prediction and comparison to genome annotation', Bioinformatics, vol. 21, no. 24, pp. 4322-9. https://doi.org/10.1093/bioinformatics/bti701

APA

Nielsen, P., & Krogh, A. S. (2005). Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics, 21(24), 4322-9. https://doi.org/10.1093/bioinformatics/bti701

Vancouver

Nielsen P, Krogh AS. Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics. 2005;21(24):4322-9. https://doi.org/10.1093/bioinformatics/bti701

Author

Nielsen, Pernille ; Krogh, Anders Stærmose. / Large-scale prokaryotic gene prediction and comparison to genome annotation. In: Bioinformatics. 2005 ; Vol. 21, No. 24. pp. 4322-9.

Bibtex

@article{d8609f9074c211dbbee902004c4f4f50,
title = "Large-scale prokaryotic gene prediction and comparison to genome annotation",
abstract = "Motivation: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. Results: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to 60% of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms that too many short genes are annotated in numerous organisms. Furthermore, genes might be missing in the annotation of some of the genomes. We predict 41 of 143 genomes to be over-annotated by >5%, meaning that too many ORFs are annotated as genes. We also predict that 12 of 143 genomes are under-annotated. These results are based on the difference between the number of annotated genes not found by EasyGene and the number of predicted genes that are not annotated in GenBank. We argue that the average performance of our standardized and fully automated method is slightly better than the annotation. ",
author = "Pernille Nielsen and Krogh, {Anders St{\ae}rmose}",
year = "2005",
doi = "10.1093/bioinformatics/bti701",
language = "English",
volume = "21",
pages = "4322--9",
journal = "Computer Applications in the Biosciences",
issn = "1471-2105",
publisher = "Oxford University Press",
number = "24",

}

RIS

TY - JOUR

T1 - Large-scale prokaryotic gene prediction and comparison to genome annotation

AU - Nielsen, Pernille

AU - Krogh, Anders Stærmose

PY - 2005

Y1 - 2005

N2 - Motivation: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. Results: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to 60% of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms that too many short genes are annotated in numerous organisms. Furthermore, genes might be missing in the annotation of some of the genomes. We predict 41 of 143 genomes to be over-annotated by >5%, meaning that too many ORFs are annotated as genes. We also predict that 12 of 143 genomes are under-annotated. These results are based on the difference between the number of annotated genes not found by EasyGene and the number of predicted genes that are not annotated in GenBank. We argue that the average performance of our standardized and fully automated method is slightly better than the annotation.

AB - Motivation: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. Results: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to 60% of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms that too many short genes are annotated in numerous organisms. Furthermore, genes might be missing in the annotation of some of the genomes. We predict 41 of 143 genomes to be over-annotated by >5%, meaning that too many ORFs are annotated as genes. We also predict that 12 of 143 genomes are under-annotated. These results are based on the difference between the number of annotated genes not found by EasyGene and the number of predicted genes that are not annotated in GenBank. We argue that the average performance of our standardized and fully automated method is slightly better than the annotation.

U2 - 10.1093/bioinformatics/bti701

DO - 10.1093/bioinformatics/bti701

M3 - Journal article

C2 - 16249266

VL - 21

SP - 4322

EP - 4329

JO - Computer Applications in the Biosciences

JF - Computer Applications in the Biosciences

SN - 1471-2105

IS - 24

ER -

ID: 83481