A Probabilistic Approach to the Protein Folding Problem: Using Stein-based Variational Inference

Research output: Book/ReportPh.D. thesisResearch

Particle-based variational inference (ParVI) methods are a powerful class of Bayesian inferences algorithms due to their flexible and accurate approximations. ParVI can interpolate between Markov chain Monte Carlo (MCMC) and variational inference (VI) methods, and, as such, ParVI is a suitable candidate for a universal black box inference engine. Bayesian statistical inference is a cornerstone of the empirical sciences. However, correctly realizing the algorithms for Bayesian inference is notoriously tricky and should be left to experts. Universal black box inference engines (and their associated probabilistic programming language (PPL)) enable the separation of model specification from statistical inference (data conditioning). This separation allows scientists to formulate statistical hypotheses (probabilistic graphical models) as stochastic (probabilistic) computer programs without implementing the inference algorithm. The black-box nature of the inference algorithm ensures the PPL can condition the probabilistic program on experimental observations automatically. Modern PPLs are generally designed for either MCMC or VI inference, requiring translation to another PPL for an unbiased comparison. Tranlation introduces unnecessary overhead and the potential for code drift–something we can wholly avoid with a generalizing framework like ParVI. An auspicious ParVI method is Stein variational gradient descent (Liu and Wang, 2016a) (SVGD) due to its direct connection with Stein’s method, gradient flows and reproducing kernels. These connections make SVGD versatile and well-suited for theoretical analysis, yielding significant results regarding SVGDs convergence, kernel choice, and convergence rates. However, the adoption of SVGD by practitioners could be more extensive. This is partly due to insufficient tooling and a lack of a mature set of best practices for hyper-parameter choice. In the first part of this thesis, we extend Stein mixture (SM) (an SVGD variate) to a whole class of approximate inference algorithms indexed by a scalar. We recommend the best choice of indexing scalar and demonstrate why by analyzing the gradient noise. We also present a ready-to-use library for inference with SM as an extension to the NumPyro PPL. We call the library EinStein, which includes the black box SM inference engine, automatic guide generation, many studied kernels, and copiable examples of Bayesian neural network (Neal, 2012) (BNN)s and Deep Markov model (Wu et al., 2018) (DMM)s. In the second part of the thesis, we study the protein structure prediction problem as a showcase for applying PPLs in the natural sciences. The protein prediction problem aims to predict the (ensemble of) conformation( s) a particular protein may adopt(s) given its sequence of amino acids (and potentially known protein homologs). A high-fidelity solution to the problem could have a massive impact on treatment for misfolding diseases such as cancer, Alzheimer’s, Huntington’s, and Parkinson’s. A canonical representation of a protein conformation is its internal (toroidal) coordinates. Internal coordinates allow efficient updates to the protein’s three-dimensional structure without violating physiochemical properties. To infer statistical models over internal coordinate representations, we introduce a variate of the bivariate von Mises distribution (a 2-torus distribution) in the (Num)Pyro PPLs. The distribution (known as the sine distribution) enables us to specify a hierarchical model over the two high-variance backbone torsion angles. Our model captures probable angle pairs for each amino acid order of magnitude faster than preexisting methods. Finally, we present our preliminary results on inferring a distribution over protein folding forcefields. Current technologies for protein structure prediction are excellent at the single-structure forecast. However, these methods are black box deep models and yield no insights into physiochemical properties–sometimes even violating them. Our formulation of the folding force as a probabilistic program allows us to automate the other tedious process of tuning protein folding forcefields using our SM inference engine. We can incorporate the existing (known) hyperparameters by choice of prior. SM ability to capture rich correlations in the parameter space makes it a suitable statistical inference algorithm for these forcefields, which tend to be highly sensitive to parameterization. Presuming our forcefield can fold (small to medium size proteins) proteins, the parameter distributions will yield insights into the importance (and sensitivity) of the different (potential) energy terms of the forcefield and a high-resolution view of the aggregate folding trajectory of the protein.
Original languageEnglish
PublisherDepartment of Computer Science, Faculty of Science, University of Copenhagen
Number of pages129
Publication statusPublished - 2023

ID: 370664411