# Computational Biology

Contact person: Pawel Winter pawel@diku.dk

Bioinformatics and Computational Biology are interdisciplinary fields that apply techniques of computer science, applied mathematics, statistics, and biochemistry to solve biological problems. Following the definitions of National Institute of Health (USA), bioinformatics is ''research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.'' Computational biology is concerned with ''development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biologial, behavioral, and social systems.''

[Protein Folding] [Computational Geometry and Proteins] [Evolutionary Trees]

## Protein Folding and Structure Prediction

The protein folding (PF) has mesmerised researchers all over the world for decades. Proteins are relatively large molecules made of linear sequences of amino acids (small molecules, 20 different kinds). Proteins typically consist of 100-500 amino acids. Upon creation, linear sequences of (known) amino acids fold into 3d structures with their surfaces determining the functions. Normally, identical sequences of amino acids fold to the same 3d structures. Folding is extremely fast (within milliseconds to seconds). Even though a considerable knowledge of molecular dynamics governing the folding of amino acid sequences ia available, researchers have not been able to completely understand the folding process. Complex energy interrelations between amino acids (including amino acids far apart in the linear sequences) also make it very difficult to simulate folding for all but very short proteins.

Instead of investigating PF, many groups are currently working on protein structure prediction (PSP). Here, the research focus is on predicting final folded 3d structures of proteins while disregarding folding processes. The vast number of degrees of freedom in linear sequences and in amino acids, makes also this problem very difficult. The reason for this is of course that the number of ways a chain of amino acids can twist and turn around each other is so vast that finding the correct 3d structure is indeed like looking for a needle in a hay stack. Reducing the number of degrees of freedom (by simplifying the linear sequences, by limiting the positions of amino acids to prespecified apropriately chosen grid positions, and by simplifying the objective functions), various optimization techniques have been applied.

### Recent Theses

Glennie Helles
Searching for the native structures of proteins
Ph.D. Thesis, Dept. of Computer Science, Univ. of Copenhagen (2010)

Rasmus Fonseca
Ab initio protein structure prediction using Bezier curve representation
M.Sc. Thesis, Dept. of Computer Science, Univ. of Copenhagen (2009)
Best Danish Computer Science M.Sc. Thesis Award (2009)

Martin Paluszewski
Algorithms for protein structure prediction
Ph.D. Thesis, Dept. of Computer Science, Univ. of Copenhagen (2008)

### Ranking Beta Sheet Topologies of Proteins

 Abstract: One of the main reasons why ab initio protein structure predictors do not perform very well is their inability to reliably identify long-range interactions between amino acids. To achieve reliable long-range interactions, all potential pairings of beta-strands (beta-topologies) of a given protein are enumerated. The beta-topologies are enumerated such that it is certain that the native beta-topology is among the enumerated. [full abstract]  Rasmus Fonseca, Glennie Helles and Pawel Winter Ranking beta sheet topologies of proteins Proc. of the Int. Conf. on Computational Biology ICCB'10, San Francisco, USA, October 2010.

### Protein Structure Prediction Using Bee Colony Optimization Heuristic

 Predicting the native structure of proteins is one of the most challenging problems in molecular biology. The goal is to determine the three-dimensional structure from the one-dimensional amino acid sequence. De novo prediction algorithms seek to do this by developing a representation of the proteins structure, an energy potential and some optimization algorithm that finds the structure with minimal energy. Bee Colony Optimization (BCO) is a relatively new approach to solving optimization problems based on the foraging behaviour of bees. Several variants of BCO have been suggested in the literature. We have devised a new variant that unifies the existing and is much more flexible with respect to replacing the various elements of the BCO. In particular, this applies to the choice of the local search as well as the method for generating scout locations and performing the waggle dance. We apply our BCO method to generate good solutions to the protein structure prediction problem. The results show that BCO generally finds better solutions than simulated annealing which so far has been the metaheuristic of choice for this problem. Rasmus Fonseca, Martin Paluszewski and Pawel Winter Protein structure prediction using bee colony optimization metaheuristic Journal of Mathematical Modelling and Algorithms 9 (2010) 181-194 [pdf]

### Improving Search for Low Energy Protein Structures with an Iterative Niche Genetic Algorithm

 In attempts to predict the tertiary structure of proteins we use almost exclusively metaheuristics. However, despite known differences in performance of metaheuristics for different problems, the effect of the choice of metaheuristic has received precious little attention in this field. Particularly parallel implementations have been demonstrated to generally outperform their sequential counterparts, but they are nevertheless used to a much lesser extent for protein structure prediction. In this work we focus strictly on parallel algorithms for protein structure prediction and propose a parallel algorithm, which adds an iterative layer to the traditional niche genetic algorithm. We implement both the traditional niche genetic algorithm and the parallel tempering algorithm in a fashion that allows us to compare the algorithms and look at how they differ in performance. The results show that the iterative niche algorithm converges much faster at lower energy structures than both the traditional niche genetic algorithm and the parallel tempering algorithm. Glennie Helles Improving search for low energy protein structures with an iterative niche genetic algorithm presented at BIOINFORMATICS Conference, Valencia, Spain (2010)

### Predicting Dihedral Angle Probability Distributions

 BACKGROUND: Predicting the three-dimensional structure of a protein from its amino acid sequence is currently one of the most challenging problems in bioinformatics. The internal structure of helices and sheets is highly recurrent and help reduce the search space significantly. However, random coil segments make up nearly 40% of proteins and they do not have any apparent recurrent patterns, which complicates overall prediction accuracy of protein structure prediction methods. Luckily, previous work has indicated that coil segments are in fact not completely random in structure and flanking residues do seem to have a significant influence on the dihedral angles adopted by the individual amino acids in coil segments. In this work we attempt to predict a probability distribution of these dihedral angles based on the flanking residues. While attempts to predict dihedral angles of coil segments have been done previously, none have, to our knowledge, presented comparable results for the probability distribution of dihedral angles. RESULTS: In this paper we develop an artificial neural network that uses an input-window of amino acids to predict a dihedral angle probability distribution for the middle residue in the input-window. The trained neural network shows a significant improvement (4-68%) in predicting the most probable bin (covering a 30 degrees x 30 degrees area of the dihedral angle space) for all amino acids in the data set compared to baseline statistics. An accuracy comparable to that of secondary structure prediction ( approximately 80%) is achieved by observing the 20 bins with highest output values. CONCLUSION: Many different protein structure prediction methods exist and each uses different tools and auxiliary predictions to help determine the native structure. In this work the sequence is used to predict local context dependent dihedral angle propensities in coil-regions. This predicted distribution can potentially improve tertiary structure prediction methods that are based on sampling the backbone dihedral angles of individual amino acids. The predicted distribution may also help predict local structure fragments used in fragment assembly methods. Glennie Helles and Rasmus Fonseca Predicting dihedral angle probability distributions for protein coil residues from primary sequence using neural networks BMC Bioinformatics 10, No. 338 (2009) 13 p.

### A Comparative Study of the Reported Performance of Ab Initio Protein Structure Prediction Algorithms

 Protein structure prediction is one of the major challenges in bioinformatics today. Throughout the past five decades, many different algorithmic approaches have been attempted, and although progress has been made the problem remains unsolvable even for many small proteins. While the general objective is to predict the three-dimensional structure from primary sequence, our current knowledge and computational power are simply insufficient to solve a problem of such high complexity. Some prediction algorithms do, however, appear to perform better than others, although it is not always obvious which ones they are and it is perhaps even less obvious why that is. In this review, the reported performance results from 18 different recently published prediction algorithms are compared. Furthermore, the general algorithmic settings most likely responsible for the difference in the reported performance are identified, and the specific settings of each of the 18 prediction algorithms are also compared. The average normalized r.m.s.d. scores reported range from 11.17 to 3.48. With a performance measure including both r.m.s.d. scores and CPU time, the currently best-performing prediction algorithm is identified to be the I-TASSER algorithm. Two of the algorithmic settings--protein representation and fragment assembly--were found to have definite positive influence on the running time and the predicted structures, respectively. There thus appears to be a clear benefit from incorporating this knowledge in the design of new prediction algorithms. Glennie Helles A comparative study of the reported performance of ab initio protein structure prediction algorithms J.R.Soc.Interface 5 (2008) 387-396. Glennie Helles Exploring parallel metaheuristics for protein structure prediction Technical Report (2009).

### Protein Structure Prediction Using Tabu Search and Half-Sphere Exposure Measure

 Here, we study the reconstruction of a protein's C_alpha trace solely from a CN vector or a pair of up/down vectors (HSE vector). This problem could become important for de novo structure prediction when the vectors are predicted. We define potential functions based on the HSE- or CN-vectors and minimize them using two metaheuristics: Monte Carlo simulation (MCS) and tabu search (TS). MCS has been widely used for protein structure prediction, and TS has been applied with great success to many optimization problems, but has rarely been used for protein structure prediction. Martin Paluszewski, Thomas Hamelryck and Pawel Winter Reconstructing protein structure from solvent exposure using tabu search Algorithms for Molecular Biology (2006), 1:20 [html, pdf]

### Branch and Bound Algorithm for Protein Structure Prediction Using Efficient Bounding

 We propose a new discrete model which makes use of secondary structure information and propose a novel branch and bound algorithm for finding global minimum structures. The energy function is very simple while structures obtained are of very high quality compared to the native structure of the protein. The model only depends on the position of C_alpha-atoms and is based on predictable contact and half-sphere-exposure numbers. The success of the branch and bound algorithm comes from the simplicity of the energy function which allows for efficient lower bound calculations of the energy. Martin Paluszewski and Pawel Winter Protein decoy generation using branch and bound with efficient bounding Proc. of 8-th Int. Workshop on Algorithms in Bioinformatics, WABI (2008), Karlsruhe, Germany Lecture Notes in Computer Science 5251 (2008) 382-393 [pdf] [doi]

## Computational Geometry and Proteins

 A modified version of chain trees that adjust themselves to the changing conformations of folding proteins is introduced. Computational results indicate that the number of intersection checks of bounding volumes is substantially reduced both in connection with the clash tests and the free energy calculations. Pawel Winter and Rasmus Fonseca Adjustable Chain Trees for Proteins to appear in the Journal of Computational Biology

### Protein Packing Quality Using Delaunay Complexes

Rasmus Fonseca, Pawel Winter and Kevin Karplus
Protein Packing Quality Using Delaunay Complexes
Proc. of 8-th Int. Symp. on Voronoi Diagrams in Science and Engineering, ISVD'11, Qingdao, China (2011)

Alpha Shapes and Proteins

 Abstract: We provide a unified description of (weighted) alpha-shapes, beta-shapes and the corresponding simplicial complexes. We discuss their applicability to various protein-related problems. We also discuss filtrations of alpha-shapes and touch upon related persistence issues. We claim that the full potential of alpha-shapes and related geometrical constructs in protein-related problems yet remains to be realized and verified. We suggest parallel algorithms for (weighted) alpha-shapes, and we argue that future use of filtrations and kinetic variants for larger proteins will need such implementations. Pawel Winter, Henrik Sterner and Peter Sterner Alpha shapes and proteins Proc. of 6-th Int. Symp. on Voronoi Diagrams in Science and Engineering, ISVD'09, Lyngby, Denmark (2009) 217-224 Henrik Sterner and Peter Sterner Parallel alpha-forms in theory and practice (in Danish) Master Thesis, Dept. of Computer Science, Univ. of Copenhagen (2008)
[top]

## Evolutionary Trees

Evolutionary trees, also called phylogenetic trees show the evolutionary interrelationships between various species. In recent years, due to the explosion of available data, species are typically represented by appropriate DNA- or protein sequences.

There is surprising number of diiferent methods that have been used to determine evolutionary trees. They can be subdivided into four main groups. Parsimony methods are based on the Occam's razor principle: Whem given the choice between two explanations, choose the simpler one. In evolutionary trees, this usually amounts to looking for a tree with fewest number of evolutionary changes between interrelated species. Maximum likelihood methods are based on probabilistic models of evolution. Given such a model, the likelihood that a tree with given topology is the sought evolutionary tree, can be calculated. The tree with the maximum likelihood is of course selected as the evolutionary tree. Consensus methods attempt to find evolutionary trees for all species using trees for (usually small) subsets of species (which are much easier to determine for biologists)). Finally, distance methods are based on appropriately defined distances between species (for example appropriate edit distances between pairs of DNA- or protein sequences). Once distances are given, the problem reduces to that of determining a minimum cost tree satisfying some side constraints.

Research on evolutionary trees at DIKU focused so far on distance methods. In particular, our group suggested a novel distance-based approach. Species are represented by points in an appropriately chosen high-dimensional space such that the distances beetween points correspond to distances between DNA- or protein sequences. Multidimensional scaling is used to determine the locations of points. Once located, the problems reduces to that of finding minimal Steiner tree for the points (possibly with appropriate side constraints). Initial work suggests that the approach has some potential. Exploiting our in-depth knowledge of the Steiner tree problem here at DIKU may prove essential for making this approach a useful alternative to such distance methods as neighbor joining. This ongoing research has been carried out in cooperation with Dorin Thomas and Marcus Brazil from the Dept. of Engineering, Univ. of Melbourne.

M. Brazil, D.A. Thomas, B.K. Nielsen, P. Winter, C. Wulff-Nilsen and M. Zachariasen
A novel approach to phylogenetic tress: d-dimensional geometric Steiner trees
Networks 53 (2009) 104-111 [doi] [pdf]

[top]