Protein Structure - kupeople.binf.ku.dk/wb/dissertation.pdf · motivation for our approach will be...

Protein StructureProbabilistic Modeling and Simulation

By

Wouter Boomsma

A dissertation submitted to the University of Copenhagen in partial fulfillment of therequirements for the degree of Ph.D. at the Faculty of Science, University of

Copenhagen.

July 2008

Supervisors:Assoc. Prof. Thomas HamelryckProf. Anders Krogh

D E P A R T M E N T O F B I O L O G Y

U N I V E R S I T Y O F C O P E N H A G E N

Cover:Front: A Ramachandran plot on the torus, generated using TorusDBN.Back: Fragments sampled using TorusDBN with an increasing amount of input information.

PrefaceThe work presented in this thesis is the result of a three year Ph.D. at the Department

of Biology of the University of Copenhagen. During this time, I was financially sup-ported by the Lundbeck Foundation and the Danish graduate school in Biostatistics, towhom I hereby extend my gratitude.

The dissertation is divided into three main parts. First, a general introduction is givento the topics of the thesis, including a broad overview of the literature in the field. Thenext part is composed of three first-author papers, containing the core of the scientificwork of my Ph.D. Finally, the last part contains a summary of my contributions, andoutlines possible directions for future work.

Wouter BoomsmaJuly 2008

Contents

Synopsis

General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Probabilistic Models – A Short Overview . . . . . . . . . . . . . . . . . . . 1Directional Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Protein Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Research Papers

Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Boomsma W, Hamelryck T (2005): Full cyclic coordinate descent:solving the protein loop closure problem in Calpha space. BMC Bioin-formatics 6:159.

Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Boomsma W, Mardia KV, Taylor CC, Ferkinghoff-Borg J, Krogh A,Hamelryck T (2008): A generative, probabilistic model of local pro-tein structure. Proc Natl Acad Sci USA 105:8932–8937.

Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Boomsma W, Bottaro S, Hamelryck T, Ferkinghoff-Borg J: MonteCarlo sampling of proteins: local moves constrained by a native-oriented structural prior. Unsubmitted manuscript.

Appendix A: Parameter Estimation and Sampling for the Torus Distribution . . . 87

Appendix B: TorusDBN MCMC Sampling Strategies . . . . . . . . . . . . . . . 91

Appendix C: TorusDBN Network Topology . . . . . . . . . . . . . . . . . . . . 97

Conclusion

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

General Introduction

Protein structure prediction is a classic problem in molecular biology, chemistry, physicsand bioinformatics. As many other great problems, it is relatively easy to formulate, buthas proven itself to be surprisingly difficult to solve. A great number of methods andideas have been proposed over the years, and it is clear that a solution to the problemis slowly coming closer. In particular, the last decade has led to substantial improve-ments of the state-of-the-art. A large part of the progress comes from the vast amountsof experimental data that has become available. Most of the new techniques use ex-perimentally determined structures from the databases in one way or another in theirprediction methods. This can for instance take the form of parameter estimation of amanually designed energy function, or even directly using small fragments from solvedstructures as building blocks to construct new ones.

In this thesis, I will present a number of new methods for the protein structure pre-diction problem. I will focus on a new rigorous way to incorporate knowledge of ex-perimentally solved structures into simulation procedures. A probabilistic model is de-veloped to capture the local structure of proteins, and it is demonstrated that this prob-abilistic approach has significant advantages over previously presented methods. Themotivation for our approach will be clarified in this introduction, along with a review ofrelevant parts of the literature.

The developed methods involve subject matter from a range of different fields. Whilethe main part of the introduction will focus on methods for protein structure prediction,it will be necessary to start with a few concepts of a more methodological character. Thisinvolves a short introduction to probabilistic modeling, and a presentation of selectedresults from the field of directional statistics.

Probabilistic Models – A Short Overview

Modern experimental techniques in biology and molecular biology are generating exper-imental data at an unprecedented rate. It is becoming increasingly difficult for humansto adequately determine the important aspects and patterns in the data, and thereby ex-tract all potential knowledge to be gained from the conducted experiments. In manyproblem domains, much can be gained by constructing a probabilistic model that cap-tures the important aspects of the underlying data. This facilitates further analysis ofthe data and can be used to predict future outcomes or simulate artificial realizations ofthe data. In a probabilistic model the data is represented by a set of stochastic variablesand associated parameterized probability distributions. These parameters are estimatedfrom experimental data using for instance a maximum likelihood (ML) approach.

1

2

The structure and behavior of a probabilistic model are governed by the two funda-mental rules of Bayesian probability theory: the sum and the product rule. For randomvariables A, B and C1

Sum ruleP (a) =

∑b

P (a, b)

Product ruleP (a, b|c) = P (a|b, c)P (b|c)

where P (a) denotes the probability distribution thatA takes the value a, P (a, b) denotesthe joint probability of {A = a} and {B = b}, and P (a|b) denotes the conditionalprobability of {A = a} given {B = b}. The product rule gives rise to the well knownBayes’ theorem. Since

P (a, b) = P (a|b)P (b) = P (b|a)P (a)

we can immediate write Bayes’ formula as

P (a|b) =P (b|a)P (a)

P (b).

In the framework of Bayesian probability theory, where a probability is understood asrepresenting a degree of belief, Bayes’ theorem has a specific interpretation. If b de-scribes some observed quantity of data, P (a|b) is referred to as the posterior probabilityof a, that is, the probability of a after having observed b. Correspondingly, P (a) isreferred to as the prior probability distribution, describing the knowledge we have of abefore observing b. Finally, P (b|a) is normally called the likelihood. Bayes’ theoremcan thus be seen as a recipe for updating your belief when confronted with new infor-mation. This interpretation is very convenient when working with probabilistic models.

The design and use of probabilistic models basically involves repeated applicationsof the sum and product rule. However, for larger models, the algebra can be ratherelaborate and it can become difficult to keep track of the dependencies in a model. Amore intuitive understanding of a probabilistic model can be obtained by formulatingit as a graphical model. The basic idea is to illustrate the conditional dependenciesbetween a set of random variables using the nodes and edges in a graph. I will confinemyself to a subclass of graphical models called Bayesian networks. All models of thistype can be represented using directed acyclic graphs (See box for details). In essence,a Bayesian network is a graphical illustration of a particular factorization of the jointprobability distribution of a set of random variables, or equivalently, a specification ofthe variables assumed to be conditionally independent. Figure 1 contains an example of

1In this introduction, capital letters will be used to represent stochastic variables and the correspondinglower case letters their realizations (values). For ease of notation, depending on the context, P (x)will denote either the probability distribution of X , or the probability that X takes the value x. Thegraphical models will be annotated using lower case letters for consistency with the derivations in thetext, although the upper case equivalents would technically be more correct.

3

a Bayesian network for a model with variables A, B, and C. This particular examplecorresponds to the joint probability factorization

directed acyclic graph A graphconsisting of nodes (circles) anddirected edges (arrows) that con-tains no directed circular paths, alsoknown as cycles.

b

a

c

d

Two nodes x and y that are directlyconnected by an arrow (x → y) arecalled parent node and child node,respectively. Correspondingly, moredistant related child nodes throughdirected paths are called descen-dants.

P (a, b, c) = P (b|a)P (c|a)P (a)

where b and c are said to be conditionally inde-pendent given a. Note that this conditional in-dependence is represented by the lack of an ar-row between these two variables in the graph. Forgraphical models in general, a property called d-separation is necessary for two variables to be con-sidered conditionally independent.There are a va-riety of ways to explain this property. Here, I willfollow the lead of Bishop [1]: Two sets of vari-ablesX and Y are said to be d-separated by a thirdset of variables Z , if all undirected paths connect-ing a node inX with a node inY are blocked givenZ . A path is blocked if either

1. On the path, there exists a node at which the arrows meet head-to-tail, or tail-to-tail, and this node is in Z .

2. On the path, there exists a node at which the arrows meet head-to-head, and thisnode and all its descendents are not in Z .

a

b c

Figure 1: A simple Bayesian network.

Two examples of this property are given in Figure 2. For models used in this thesis,head-to-head edge collisions never occur, and the conditional dependencies are deter-mined simply by the first rule.

Dynamic Bayesian networks (DBNs) are a subclass of Bayesian networks that areused to model sequences of variables. They are often applied in time series analysis,but are also a natural choice for modeling biological sequences [2]. A Markov modelis an example of a very simple DBN. Each variable is only directly dependent on thevariable preceding it (the Markov property), and the variables correspond directly to thesequential data being modeled. As a simple example, consider a naive model of DNAsequences, where each variable represents a discrete probability distribution over thefour nucleotides A, C, G, and T. In this case, the parameters of the model would be the

4

u

v

x

w

(a)

u

v

x

w

(b)

Figure 2: Two Bayesian networks. The shaded node x is observed. (a) Conditional onx, node u, v and w are independent. (b) Conditional on x, v and w are independent,and u and w are independent. However, u and v are not, since they have the commoncause x (head-to-head arrows).

4×4 transition matrix describing the probabilities of observing a particular nucleic acidat a given position depending on the value of the preceding one.

In a hidden Markov model the sequential dependencies are separated from the ob-served data by hidden variables that are inherently unobserved (see Figure 3). Theobserved variables in such a model are sometimes referred to as emissions. Our ex-ample of DNA sequences could be for instance now be modeled by a two state hiddenMarkov model. Each of these states would correspond to a distinct emission proba-bility distribution over the 4 nucleic acids. In this case, we would only have a 2 × 2transition matrix, reflecting the probability of moving from one hidden state to the next.Note that the hidden states do not necessarily have any interpretation in the context ofthe problem domain; they can simply be a convenient way to parameterize the prob-lem. The hidden Markov model example of DNA sequences has fewer parameters2

(2 × (2 − 1) + 2 × 3 = 8) than the Markov model (4 × 3 = 12). The relative perfor-mance of the two models will, however, depend on how naturally the nucleic acids canbe divided into two groups (the hidden states).

x1

x2

x3

xn

(a) Markov model

h1

x1

x2

x3

h2

h3

xn

hn

(b) Hidden Markov model

Figure 3: A standard Markov model and a hidden Markov model.

Although hidden Markov models are naturally expressed in the graphical model frame-work, much of the original HMM literature uses a different graphical notation for these

2These are the number of free parameters. Each discrete probability distribution of n states has onlyn− 1 free parameters because the probabilities must sum to one.

5

models, which sometimes leads to confusion. HMMs are often depicted by state dia-grams, where each node corresponds to a state of the hidden variable, and edges reflectthe probability of moving from one state to another. In contrast, as we saw, the nodesin a graphical model represent stochastic variables and the edges reflect their depen-dencies, hiding the details of the underlying distributions of the variables. In a statediagram, the sequential nature of an HMM is not explicitly illustrated, while this is atthe core of the graphical representation of a dynamic Bayesian network. Sometimes,state diagrams are unrolled to include the sequential aspect, resulting in so-called trellisdiagrams. An example of all three different representations is given in Figure 4. Muchof the HMM literature is devoted to designing clever model structures that take advan-tage of prior information of the problem domain. This corresponds to removing edgesin the fully connected networks of Figure 4(a) and Figure 4(b), but in the framework ofa DBN, this simply corresponds to inserting zeros in the transition matrix connectingthe hidden states variables. Loosely speaking, a specific structure of the state diagramof an HMM thus corresponds to a strong prior on the transition probability matrix.

1

2

3

(a)

1

2

3

1

2

3

1

2

3

1 2 3

1

2

3

n

(b)

h1

x1

x2

x3

h2

h3

xn

hn

(c)

Figure 4: Three graphical representations of the same hidden Markov model, a statediagram (a), a trellis diagram (b), and the graphical model representation (c).

The model that is presented in Chapter 2 can in principle be described as a multi-track HMM (each hidden state has multiple emission variables). However, the modelis trained with a fully connected transition matrix and in order to avoid the potentialconfusion in notation, I refer to it in the more general terms of a dynamic Bayesiannetwork – hence the name TorusDBN.

Probability Evaluation

Evaluating the probability of a given observation is an important application of a proba-bilistic model. In the case of a hidden Markov model, this evaluation requires summingover all possible hidden node sequences. A naive implementation of this calculation,where each possible sequence of hidden node states is considered explicitly, has a com-plexity that grows exponentially with the length of the model, making it infeasible forrealistic applications. Fortunately, the problem can be solved much more efficientlyusing a dynamic programming technique called the forward algorithm [3, 4]. Like alldynamic programming techniques, the method is based on the insight that the prob-lem can be viewed as consisting of smaller subproblems, of which the solutions can be

6

directly combined to form the solution to the main problem. For a sequence of observa-tions x of length n, we can write the probability as a sum over the hidden node valueshn at the last position

P (x1, . . . , xn) =∑hn

P (x1, . . . , xn, hn). (1)

Using the Markov property, the expression under the sum on the right hand side can bewritten recursively as

P (x1, . . . , xi, hi) = P (xi|hi)∑hi−1

P (x1, . . . , xi−1, hi−1)P (hi|hi−1) (2)

which gives us a direct recipe for an efficient calculation: Create a m × n matrix F,where m is the number of hidden node states. Each entry Fi,j = P (x1, . . . , xi, hj) cor-responds to the joint probability of hidden state j at position i in the sequence summingover all possible ways of getting there. The matrix can be filled from left to right, cal-culating each entry using equation (2). The first column F0.j is a special case, which isnormally handled using either a specific start-distribution or the stationary distributionof the Markov chain. When the matrix is filled out, equation (1) is used to obtain theprobability of the complete sequence.

There is a closely related method called the backward algorithm, that uses a differentrecursion to calculate the likelihood in the reverse direction

P (x1, . . . , xn) =∑h1

P (x1|h1)P (x2, . . . , xn|h1)P (h1) (3)

P (xi+1, . . . , xn|hi) =∑hi+1

P (xi+2, . . . , xn|hi+1)P (xi+1|hi+1)P (hi+1|hi). (4)

Correspondingly, the backwards matrix B is filled out from right to left, using (4) at eachstep and (3) in the first column to calculate the likelihood. The forward and backwardalgorithms can be combined to calculate the probability of observing a certain state atan arbitrary position in a sequence, given all observed data. This is sometimes referredto as the posterior state probability

P (hi|x1, . . . , xn) ∝ P (x1, . . . , xn|hi)P (hi)

= P (x1, . . . , xi|hi)P (xi+1, . . . , xn|hi)P (hi)

= P (x1, . . . , xi, hi)P (xi+1, . . . , xn|hi).

It is clear that this probability can be calculated simply as the product of the forwardand backward factors (up to a normalization constant). This combination of the twomethods is called the forward-backward algorithm.

7

SamplingSimulating data, or sampling, is another important application for a probabilistic model.Sampling refers to the task of generating data points from the probabilistic distributionthat the model represents. In the case of a hidden Markov model, this problem canbe efficiently solved using another dynamic programming technique called the forward-backtrack algorithm, not to be confused with the forward-backward algorithm describedabove.

The forward-backtrack algorithm has not been described very frequently in the liter-ature, where the majority of applications of hidden Markov models seem to focus onprediction rather than sampling. To the best of my knowledge, it was first described inbioinformatics by Cawley and Pachter in 2003 [5]. The algorithm is described formallyin ref. [6] and in the supplementary material of Chapter 2, but for completeness I includean intuitive description here.

To keep things simple, I will demonstrate only how to sample a hidden node statesequence h given an observed emission sequence x. In actual applications, samplinghidden node state sequences is perhaps not very interesting in its own right. However, ina multi-track HMM it is possible to have several emission variables for each hidden nodevariable. Some might be observed, while others are output values, which one wouldlike to resample. For such models, the hidden node sequence is first resampled basedon the observed emission variables, after which values for the unobserved emissionvariables can be sampled given the resampled hidden node sequence. We only treatthe first problem here; since the emission variables are independent given the hiddennode sequence, the second step is simply a matter of sampling from the correspondingemission distributions of the model.

Let us assume that we have an initial hidden node sequence h, and that we wishto resample the values from index s to t, keeping the rest of the hidden sequence atits original values. Given the observed values xs, . . . xt, and the hidden values hs−1

and ht+1 at the boundaries, the probability distribution for the last hidden value can bewritten as

P (ht|xs, . . . , xt, hs−1, ht+1) =P (ht, xs . . . , xt, hs−1, ht+1)

P (xs . . . , xt, hs−1, ht+1)

∝ P (xs, . . . , xt, hs−1, ht)P (ht+1|ht) (5)

The first factor can be efficiently calculated using the forward algorithm3. The secondis simply given by the transition matrix of the model. Equation (5) thus represents adiscrete distribution over Ht, from which a value can be sampled directly (after nor-malizing). The key insight is that the situation for Ht−1 is equivalent, this time condi-tioned on Ht = ht at the boundary. For Ht−1, the calculation will involve the factorP (xs, . . . , xt−1, ht−1), which is available from the same forward matrix as before. Theentire sampling procedure can thus be reduced to a single forward pass from s to t,followed by a backtrack phase from index t to s, sampling values based on (5).

3This requires that the probability of hs−1 is included, by taking it into consideration when filling in thethe first column of the forward matrix (position s)

8

Parameter EstimationI will not cover the training aspects of probabilistic models in great detail, but confinemyself to the simple example of a hidden Markov model with discrete emission vari-ables. An example of parameter estimation using a continuous probability distributionis given in Chapter 2, and more information can be found in the refs. [1] and [7].

Although Bayesian methods are becoming more common, the standard training pro-cedure for HMMs is maximum likelihood (ML) estimation. When a model is fullyobserved (for instance if we in Figure 4(c) for each observation of xi also knew thecorresponding value of hi), maximum likelihood estimates of the transition and emis-sion probabilities would simply be the observed frequencies in the training data. If wedenote the transition probability from state i to j as τij , and the emission probability ofobserving state l when in hidden state k as εkl , we have

τij =nij∑j′ nij′

εkl =mkl∑l′ mkl′

. (6)

where nij and mkl are the corresponding frequencies in the data set.However, per definition, we do not know the values of the hidden states. The solution

to this problem is found in the expectation maximization (EM) algorithm [8], also calledthe Baum-Welch algorithm in the case of HMMs [9]. This is an iterative algorithm, thatconsists of a repeated cycle of two steps. For cycle t

1. Calculate the expected values of the counts Nij and Mij based on the most recentestimates of τij and εkl, and the data D.

nij = E(Nij|D, τ [t−1], ε[t−1])

mkl = E(Mkl|D, τ [t−1], ε[t−1]).

2. Update the estimates τij and εkl by (6), but using the estimated counts (nij, mkl)instead of (nij,mkl).

The two steps are repeated until the algorithm convergences. The expected count valuescan be calculated using the forward-backward algorithm described above. The algo-rithm is guaranteed to produce estimates of ever increasing likelihood, but occasionallygets trapped in a local optima, failing to find the global likelihood maximum. Severalvariants of the EM algorithm exist. In cases where large amounts of data are available,the stochastic EM algorithm can be an appealing alternative known to avoid convergenceto local optima [10]. The structure of the algorithm is the same as for EM. However, ineach cycle the hidden variables H are filled in with values h sampled given the currentparameters of the model. The model can then be considered completely observed andequation (6) can be used directly to estimate the parameters. More formally, in eachcycle t

9

0.0

0.1

0.2

0.3

0.4

0.5

0.6

− π − π 2 0 π 2 π

Figure 5: A von Mises distribution with µ = 3/4π and κ = 3.

1. Complete the data by drawing a sample from

P (h|D, τ [t−1], ε[t−1])

2. Directly use equations (6) to obtain new parameter estimates τ [t], and ε[t].

The sampling in step 1 could be implemented using the forward-backtrack algorithmdescribed previously. However, it turns out that a less ambitious sampling strategy maybe sufficient. For the training of the TorusDBN presented in Chapter 2, we used a singleiteration of Gibbs sampling to fill in the hidden node values. Since this approach avoidsthe full dynamic programming calculation, it greatly speeds up the individual cycles ofour the stochastic EM algorithm and converges consistently.

Directional Statistics

Directional statistics is a term used to describe statistical concepts and methods de-signed for modeling directions (unit vectors), axes (unit vectors with arbitrary sign) androtations. The most common application is circular or periodic data, which arise invirtually any scientific field. Simple examples include modeling wind-directions or thetime-of-day. An intuitive motivation for the existence of these statistical methods is theobservation that for circular data, the standard estimator for the mean value does notgive meaningful results. For instance, the arithmetic mean of two angles 10◦ and 350◦

is 180◦, quite different from the mean value 0◦ that one would expect from a simpleinspection of the unit circle.

The field of directional statistics has produced numerous distributions on more or lessexotic manifolds over the last decades. For a comprehensive overview, see ref. [11].Here, I will mention a few distributions for angular data, which have applications inprotein structure prediction.

The best known angular distribution is the von Mises distribution, originally intro-duced by von Mises in 1918 [12], with density function

10

Figure 6: Three examples of Kent/FB5 distributions with different mean and concen-tration parameters. Figure by Thomas Hamelryck [6].

fvm(θ) =exp(κ cos(θ − µ))

2πI0(κ)(7)

where µ is the mean, κ is a concentration parameter, and I0 is the Bessel function oforder 0. The von Mises distribution is the circular equivalent of the Gaussian distribution(Figure 5). This can be illustrated by considering the distribution at high concentrations.Using the small angle approximation cos(x) ' 1− (1/2)x2, the distribution becomes

fvm(θ) 'exp(κ) exp(−κ

2(θ − µ)2)

2πI0(κ). (8)

which is clearly similar to the functional form of a standard Gaussian.

As we will see in a later section, the structure of a protein can be represented using asequence of angles. The von Mises distribution is therefore a powerful tool in the proba-bilistic modeling of protein structure. However, often the angles will occur in dependentpairs, which can be captured more accurately using bivariate angular distributions. Asan example, consider angle pairs (θ, τ) where θ ∈ [0;π] and τ ∈ [−π; π]. Any such an-gle pair corresponds to a point on a sphere. Data of this type can be efficiently modeledusing the Kent distribution [13], also known as Fisher-Bingham-5 (FB5). This is illus-trated in Figure 6. Recently, this distribution was used to design a probabilistic modelfor a simplified representation of proteins [6].

For a more detailed description of the protein geometry, angle pairs of two full angles(both ranging from −π to π) are required. Such angle pairs correspond to points on atorus, and can be modeled using a bivariate von Mises distribution.

11

The Bivariate von Mises DistributionA general form of the the bivariate von Mises distribution was presented in 1975 [14]

fm(φ, ψ) = Cm exp(κ1 cos(φ− φ0) + κ2 cos(ψ − ψ0)+

[cos(φ− φ0), sin(φ− φ0)]A[cos(ψ − ψ0), sin(ψ − ψ0)] (9)

where Cm is a normalization constant, A is a 2 × 2 matrix, and (φ0, ψ0) are the meanvalues corresponding to φ and ψ, respectively. A sub-model with fewer parameters waslater proposed by Rivest [15]

fr(φ, ψ) = Cr exp(κ1 cos(φ− φ0) + κ2 cos(ψ − ψ0)+

α cos(φ− φ0) cos(ψ − ψ0) + β sin(φ− φ0) sin(ψ − ψ0). (10)

Two particular choices of α and β have been studied in some detail. For α = 0 andβ = λ we have

fs(φ, ψ) = Cs exp(κ1 cos(φ− φ0) + κ2 cos(ψ − ψ0)+

λ sin(φ− φ0) sin(ψ − ψ0) (11)

which has been referred to as the sine model, described by Singh, Hnizdo and Dem-chuk [16]. For α = β = κ3, we have the cosine model [17]

fc(φ, ψ) = Cc exp(κ1 cos(φ− φ0) + κ2 cos(ψ − ψ0)−κ3 cos((φ− φ0)− (ψ − ψ0)). (12)

Both the sine and cosine models are designed as circular equivalents of the bivariateGaussian distribution. They have have the same number of parameters as the bivariateGaussian, and in the case of small deviations from the mean, we have

sin(x) ' x

cos(x) ' 1− (1/2)x2

Inserting into (11) and (12), we get

fs(φ, ψ) ' Cs exp(κ1 + κ2) exp(−12(κ1(φ− φ0)

2+

κ2(ψ − ψ0)2−

2λ(φ− φ0)(ψ − ψ0)))

and

fc(φ, ψ) ' Cc exp(κ1 + κ2 − κ3) exp(−12(κ1(φ− φ0)

2+

κ2(ψ − ψ0)2−

κ3((φ− φ0)− (ψ − ψ0))2))

= Cc exp(κ1 + κ2 − κ3) exp(−12((κ1 − κ3)(φ− φ0)

2+

(κ2 − κ3)(ψ − ψ0)2+

2κ3(φ− φ0)(ψ − ψ0)))

12

clearly demonstrating that both distributions can be conveniently approximated by bi-variate Gaussians at high concentrations.

For our probabilistic model of local protein structure presented in Chapter 2, a distri-bution was required for which efficient parameters estimation and sampling was possi-ble. We embarked on a collaboration with Prof. Mardia from the University of Leedsto find the optimal distribution for this purpose. Mardia and coworkers completed a rig-orous comparison of the cosine and sine distributions, and revealed a number of usefulproperties [17]. Their study demonstrated that the cosine model was more efficient inthe modeling of correlations between angles, and this distribution was therefore chosenfor our model. In the remainder of this text, I will refer to this distribution as the torusdistribution.

During the implementation of our model, an extensive amount of work was put intofinding optimal procedures for parameter estimation and sampling for the torus distri-bution. These details are presented in Appendix A.

Protein StructureProteins are vital to virtually any process in a biological organism, from enzymes cat-alyzing biochemical reactions, to structural or mechanical building blocks (e.g. cellstructure, muscle contraction), to the regulation of the transcription of genes. It waspostulated by Anfinsen an coworkers in the 1950’s [18] and now generally accepted,that many proteins have a unique three-dimensional structure, corresponding to a mini-mum of free energy. Proteins fold consistently to this structure, which is therefore oftenreferred to as the native structure. To a large extent, it is the three-dimensional structureof a protein that determines its function, and consequently, much time and effort is spenton determining these structures.

A protein is a linear chain of amino acid residues, connected by peptide bonds to forma linear polymer. There are twenty amino acids that can be represented as codons ingenes, and these are therefore the standard components in naturally occurring proteins.The 20 amino acids each have a Roman letter associated to them so that the compositionof amino acids in a protein can be easily specified as a sequence of letters. Such asequence is often referred to as the primary structure of a protein.

The amino acids have different chemical properties. Some are hydrophobic, beingunable to form hydrogen bonds with water molecules, while others are hydrophilic andthereby have a tendency to interact with surrounding water. When a protein is broughtin contact with a watery environment, it tends to minimize its energy by arranging thethree dimensional structure in such a way that the hydrophobic residues are buried. Thisis illustrated in Figure 7.

The amino acids also interact with each other, forming hydrogen-bonds. In 1951,before the first experimental determination of a complete protein structure, Corey andPauling predicted that certain typical local structural motifs would arise from specifichydrogen-bond patterns [19]. These motifs, referred to as α-helices and β-sheets, werelater confirmed, and are now known to exist in almost all proteins. The classification ofstructure into α-helices, β-sheets and coil (the rest) is called the secondary structure of

13

Figure 7: Schematic of the hydrophobic collapse, where hydrophobic residues (black)are buried inside the core of the protein, while hydrophilic residues (white) are typi-cally found at the surface. Figure by Thomas Hamelryck.

proteins, while the tertiary structure of a protein refers to its three dimensional structure(see Figure 8).

Protein Structure Prediction

In recent years, a number of large scale genomic sequencing projects have greatly in-creased the amount of protein sequence data available in online databases. However,the corresponding structures are not easily determined. Experimental techniques suchas X-ray crystallography and NMR spectroscopy can accurately determine the structureof a protein, but these techniques are expensive and time consuming. The experimentaldetermination of a protein structure can easily take years in the laboratory. Much couldtherefore be gained if methods were available that could simulate the folding processon a computer, directly predicting the structure of a protein from its sequence of aminoacids. This challenge is commonly referred to as the protein folding problem, or proteinstructure prediction4.

The protein structure prediction problem has been the subject of substantial amountsof research in the last decades, and as methodology and computer power is constantlyimproved, a solution to the problem is slowly coming closer. In particular, the homol-ogy modeling techniques are becoming quite powerful [20]. For a given target pro-tein, these methods search the structure database to locate evolutionary related proteins,and use these structures as templates for determining the new structure. Clearly, giventhe increasing number of experimentally solved structures, these methods have an even

4Strictly speaking, the terms protein folding and protein structure prediction are not completely equiv-alent. The latter is slightly less ambitious in that it focuses on finding the native structure, and not onthe process that brings it there. The two terms are however often used indiscriminately.

14

MTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE(a) Primary structure

CEEEEEECCCCCCEEEEEECCCHHHHHHHHHHHHHCCCCCCEEEEECCCCEEEEEC(b) Secondary structure

(c) Tertiary structure

Figure 8: Three levels of protein structure illustrated for Protein G (2gb1). (a) The se-quence of amino acids. (b) The helix (H), strand/sheet (E), coil (C) secondary structureclassification. (c) The three-dimensional structure, annotated with secondary structure:α-helix (red), β-sheet/strand (yellow), coil (green).

greater potential in the future.The term ab initio is used to characterize prediction methods that do not use homol-

ogy modeling, and instead attempt to solve the problem from basic principle. Despitethe increasing success of homology modeling methods, ab initio methods will remainessential if we are to gain a detailed understanding of the folding process and the rela-tionship between protein sequence and structure. There are at least two distinct chal-lenges involved in the ab initio structure prediction problem. One is the design of a goodenergy function, and the second is to devise an efficient sampling or search strategy. Inthis dissertation, I focus entirely on the second problem, which has been the main topicof my Ph.D.

Simulation and Optimization

There are two distinct ways of thinking about the protein structure prediction problem.It can be considered strictly as an optimization problem, where the goal is simply to findthe structure with lowest energy according to the given energy function. Alternatively, itcan be approached as a simulation task, simulating the physical folding process at somespecified level of detail. This could be the detailed dynamics of a protein, using molec-ular dynamics (MD), or by Monte Carlo simulations, which can probe the dynamics ofa system at equilibrium at larger time scales (thermodynamics).

In MD simulations, the aim is to model the actual dynamics of a protein system by

15

iteratively calculating the forces on the individual atoms (based on the energy function)and solving the equations of motion that result from the corresponding accelerations.This type of simulation is often used to assist experimental protein structure determina-tion, and to study the dynamics of proteins. However, due to the high level of detail, thetime scales covered in such simulations are quite limited, typically in the nanosecond tomicrosecond range [21, 22]. The time scale of a complete folding process of a proteinis known to range from nanoseconds to minutes [23].

Markov Chain Monte Carlo potentially spans over a much wider range of time scales,and is therefore generally considered a better candidate for ab initio protein structureprediction. Rather than modeling the dynamics of a system, the goal of an MCMCsimulation is to capture statistical (thermodynamical) properties of a system. The ideais to sample directly from the equilibrium distribution of the system, which in the caseof protein simulations is the Boltzmann distribution. While the types of moves in an MDsimulation are strictly dictated by Newton’s laws of physics, there is no such restrictionon the moves in an MCMC simulation. The only requirement is that the simulationis not biased, which can be ensured by enforcing detailed balance and ergodicity (seebox).

Detailed balance is the require-ment that at equilibrium, any pro-cess in the system will occur at thesame rate as the reverse process. Forstationary probability distribution π,transition probability P , and states xand x′, this can be written as

π(x)P (x→x′) = π(x′)P (x′→x)

Ergodicity Intuitively, this require-ment states that it is possible tomove from any state to any otherstate in a finite number of moves.

Of all approaches to protein structure predic-tion, pure energy minimization (optimization) isthe most liberal. There are no restrictions on thetypes of moves that can be employed. The onlyrequirement is that they lead to an efficient confor-mational exploration. However, as long as energyfunctions remain imperfect, it does not seem idealto seek only the global energy minimum, whichcould easily be an artifact of the energy function.For this reason, MCMC simulations are often pre-ferred over energy minimization techniques, be-cause they make it possible to explore not onlythe native state, but the entire equilibrium distri-bution, and thereby potentially the stability asso-ciated with any obtained conformation.

Representation

The term backbone is used to refer to those atoms in a protein that are connected to formthe main linear chain. These atoms are the same for all amino acid types. The aminoacids differ only in the structure of their side chain, for which the number of atomsvaries greatly between the different amino acid types. A schematic representation of theprotein chain is given in Figure 9. Note that in some cases, the term backbone includesalso the O and H atoms.

For simulation purposes, a choice has to be made regarding the level of detail at whichto represent a protein. Traditionally, MD studies represent all atoms in a protein explic-itly. Sometimes water molecules interacting with the surface of the protein are also

16

C

C

O

O

H

H

N

NCα

Cα

R

R

C

OH

N

Cα

R

Figure 9: Representation of a protein. The grey delimiters denote the individualresidues, while the R-boxes correspond to side chains, which are different for eachtype of amino acid. The N-Cα-C-N-Cα-C-. . . chain is called the backbone of a protein(sometimes including the O and H atoms).

included (explicit solvent models). More recently, attempts to speed up MD simulationshave led to the development of coarse-graining techniques, in which similar atoms aremerged together into pseudo atoms.

Markov Chain Monte Carlo methods often use more simplified representations. Acommon choice is to only include the heavy atoms of the protein backbone (N-Cα-C). Since the backbone bond lengths and bond-angles are known to display very littleflexibility, they can be set to fixed values [24]. Likewise, the ω dihedral angle aroundthe peptide bond (connecting N and C) is normally considered to be in one of twostates, either a trans conformation ω = 180◦ or the more uncommon cis conformation(ω = 0◦). This leaves the two dihedral angles φ and ψ as the only degrees of freedom ateach residue (Figure 10(a)). The (φ, ψ) angle pairs are sometimes plotted against eachother in so-called Ramachandran plots [25], to assess the nature of the local structure ata given residue, or to detect outliers in a structure determination study. In some cases,the representation in Figure 10(a) is extended slightly to incorporate the first atom ofthe side chain (Cβ), and the O and H atoms to support backbone hydrogen bonding (SeeFigure 11). The positions of the additional atoms can be approximately determinedfrom the backbone conformation, and the extended representation therefore does not

C

C

N

NCα Cα

Cα CN

i

i +1

i +2

φi

ψi

φi+1

ψi+1

φi+2

ψi+2

(a) Heavy-atom-only backbone representation

i

i +1

i +2

Cα Cα

Cα

τi

τi+1

τi+2

τi+3

θi

θi+1

θi+2

(b) Cα-only representation

Figure 10: Two different coarse-grained representations of protein structure

17

C

OH

N

Cα

Cβ

φω

ψ

Figure 11: The full backbone representation of a protein.

introduce additional degrees of freedom.An even more coarse-grained approach is the Cα-only representation [26, 27]. Here,

the entire protein is represented simply as a chain of Cα atoms, with the pseudo-bondangle θ and pseudo dihedral angle τ associated with each Cα as the only degrees offreedom (Figure 10(b)). A protein can be reproduced surprisingly accurately from itssequence of (θ, τ ) angles and the Cα representation is therefore very appealing. How-ever, for this representation, it appears to be difficult to design good energy functions.Specifically, it is not trivial to formulate a good hydrogen-bonding term (necessary toform β-sheets) using only the positions of the Cα atoms. This problem was one of mainmotivations for developing the full-backbone probabilistic model of local structure, asdescribed in Chapter 2.

Another type of coarse graining that has been used frequently in protein folding islattice simulations [28]. In such simulations, atoms are only allowed to move in a prede-fined lattice (grid), thereby discretizing three-dimensional space. This can significantlyspeed up simulations, but at the price of a considerable reduction of accuracy. Withthe increased computational power of recent years, more and more studies seem to bemoving toward off-lattice simulations of proteins.

Local Structure

The set of recurring structural motifs found by Corey and Pauling has been extendedsubstantially in the last decades. In addition to the α-helix and β-sheet, the set of knownmotifs now includes β-turns, β-hairpins, β-bulges, and motifs specifically for the endsof helices (N-caps, C-caps). These motifs have been studied extensively, both experi-mentally and by knowledge-based approaches, revealing their amino acid preferencesand structural properties (see ref. [29] and references therein).

It has been suggested that the recurring structural motifs may form early in the fold-ing process, folding independently of the rest of the chain, and thus represent foldinginitiation sites [30,31]. This idea is supported by several experimental studies that havedemonstrated these motifs to form even in small peptides, which do not normally dis-play well defined structure [30, 32–38]. By treating regions with strong local structureseparately during folding simulations, the conformational search space may be signifi-cantly reduced. In this section, I will introduce different methods of local structure clas-sification, prediction and modeling. Although the standard helix-sheet-coil secondary

18

structure classification is useful in many situations, it is not sufficiently detailed forspecifying three-dimensional protein structure. I will therefore focus strictly on moredetailed descriptions of the local structural properties.

Starting in the late 1980’s, several attempts were made to automate the process ofdetecting local structure motifs in proteins, using the increasing amount of publiclyavailable structural data. Three of these early studies are worth highlighting. In 1989,Unger et al. introduced the concept of a building block consisting of 6 residues [39]. Itwas proposed that all proteins could be reconstructed from a limited set of such blocks,by assembling them in different conformations. In their study, the building blocks wereobtained in an automated fashion by clustering all 6-residue fragments in a data setof four proteins, so that the RMSD between any two fragments within a cluster wassmaller than 1 Å (see box). One structure was chosen as a representative for eachcluster, thus constituting the actual structural building block. It was demonstrated thatout of all 6-residue fragments in the entire Brookhaven protein data bank (PDB) (354proteins, at the time), 92% had a smaller than 1.25 Å deviation from one of the buildingblocks. Although the results from reconstructing entire protein structures were not asconvincing, the study clearly verified that local structural motifs are highly recurring inproteins and could be identified with an automated procedure.

RMSD The root mean square de-viation is a measure commonly usedto compare two protein structures toeach other. The method requires thatboth structures have the same length.Letting p and q denote the positionvectors of the two proteins, it is cal-culated as

RMSD =

√√√√ 1n

n∑i=1

||pi − qi||2

The structures are normally super-imposed onto each other first [40],so that the RMSD value repre-sents the minimal deviation betweenthem.

Wodak and coworkers used a similar approach,but with variable length fragments (4–7 residues).For efficiency, the distance between two fragmentswas calculated as the deviation of their inter-Cα

distances [41]. They used a hierarchical clusteringscheme, resulting in a tree relating the fragmentsto each other. By placing cuts in the tree at dif-ferent heights, the method could produce cluster-ings of the data at different resolutions. They pre-sented a final set of clusters, called structural fam-ilies, which each contained 50 fragments on av-erage, but were divided further if large backboneangle deviations were detected among the mem-bers. Using this method, many of the well knownstructural motifs (for instance various types of β-hairpins) were recovered.

Neither of the the two described methods had aconvincing way to determine the number of clus-ter components that are supported by the data. This was noted by Hunter and States,who instead proposed a heuristic Bayesian classifier approach, that automatically de-termines the optimal number of clusters [42]. For the data set used, an optimum of 27clusters was reported. Following the clustering procedure, amino acid and secondarystructure preferences of the clusters were analyzed and well known patterns were dis-covered. Finally, the authors calculated the Markov transition statistics between theclusters, effectively turning their method into a hidden Markov model (although theydid not identify it as such). The study by Hunter and States seems to have been over-

19

looked in more recent studies5, although this 1992 paper already contains many of thebasic ingredients for building a fully probabilistic model of local structure.

At the time, the low number of solved structures available in online databases severelylimited the accuracy of the local structure classification schemes. However, the se-quence databases contained much more information and grew at a faster rate. This factmotivated an alternative approach to local structure classification. Instead of cluster-ing known structures and analyzing amino acid preferences of these structures, the ideawas to find patterns in sequence space first, and only then consider the correspond-ing structural motifs. Han and Baker proposed a method for finding sequence motifsfrom multiple sequence alignments in the HSSP database [43]. Each column in a mul-tiple sequence alignment corresponds to a distribution over amino acids, and classeswere defined by clustering individual columns (single site motifs), or ranges of columns(length 2–15), based on the similarity of the corresponding distributions. For the singlesite patterns, the results were as expected, with hydrophobic and polar residues groupedin separate clusters. For the longer fragments, more informative patterns emerged: clearamphipathicity signals (alternating hydrophobic and hydrophilic regions) were observedin some clusters, while others clusters demonstrated highly conserved amino acids atparticular positions. By design, the HSSP database contains at least one experimen-tally determined protein structure for each sequence alignment, and each cluster wastherefore associated with a number of structures. This made it possible to annotatethe clusters with secondary structure, and thereby to relate them to well known localstructure motifs, with convincing results [44, 45].

In a later study, Bystroff and Baker extended the method to a clustering approachthat simultaneously optimized both sequence and structure signals. Also this methodstarts with clustering segments of multiple sequence alignments, and again, the designof HSSP ensures that each alignment has a database structure associated to it. Foreach cluster, one of the corresponding structures was chosen as the paradigm, a struc-tural representative of the cluster. The algorithm proceeded with a two-step iterativerefinement of the clusters. First, members were excluded from a cluster if they werestructurally different from the paradigm. Second, new sequence profiles were createdfor each cluster, and alignments were then reassigned to the best matching cluster. Thiswas repeated until convergence. The procedure resulted in the I-sites library containing82 clusters, each consisting of a sequence profile and a paradigm structure. Most of thewell known local structure motifs were found to be among them, and several new motifswere identified [31].

While earlier studies focused primarily on classification of local structure, an increas-ing number of methods were starting to take advantage of local structural motifs directlyin the prediction of protein structure. Most techniques were based on assembling frag-ments of local structure to form complete structures. This technique is called fragmentassembly, and was already in 1986 proposed by Jones and Thirup as a method to con-struct models from density maps [46].

In 1991, Rooman et al. proposed a method for local structure prediction, based ona 7 state model, each state characterized by a specific value of φ, ψ, ω dihedral an-

5This includes my own paper (Chapter 2). I only very recently became aware of this papers existence.

20

gles [47]. Later, Scheraga and coworkers used a fragment-based technique, were smallstructural fragments were individually energy minimized, before being merged into aglobal structure [48, 49]. While both studies demonstrated reasonable results for lo-cal structure prediction, unfortunately, they did not perform well on entire proteins.However, already in 1993, Bowie and Eisenberg presented the first complete fragment-assembly method for ab initio protein structure prediction, in the form of a geneticalgorithm, and demonstrated remarkable results for small helical proteins [50]. The LI-NUS method, presented by Srinivasan and Rose in 1995, enforced local structure byupdating 3-residue segments at a time, for each segment randomly choosing among aα-helix, β-sheet, β-turn and coil state, and updating the dihedral angles of the segmentto ideal values for this state [51]. This approach was tested with reasonable success ona number of protein fragments.

CASP The Critical Assessmentof Techniques for Protein Struc-ture Prediction (CASP) is a bien-nial event, where structure predic-tion methods are tested and com-pared to evaluate the current state ofthe art. The targets used for pre-diction are exclusively unpublishedprotein structures, to ensure blindpredictions, and thereby a fair com-parison of the methods [52].

In 1997, Jones presented a fragment-assemblyapproach based on fragments with manually se-lected supersecondary structure, and demonstrateda correct prediction of a complete protein targetfrom the second CASP competition [53] (see box).Finally, in the same year, Baker and coworkerspresented the first version of the Rosetta proteinstructure prediction method, with a knowledge-based energy function and using multiple sequencealignments to select relevant fragments [54]. Thisstudy represented a significant step forward in thefield, and the Rosetta method has since then con-sistently been among the top-performing participants in CASP competitions.

In the last decade, numerous new fragment-assembly methods have been devised. Acomplete treatment is outside the scope of this introduction. However, we note that theprinciples introduced in these first years are still heavily used. In particular, fragment-assembly approaches are still at the core of most successful ab initio prediction methodstoday [55].

Probabilistic Modeling

As the success of fragment-assembly methods grew, there was an increasing interest toput the problem of local structure on a more sound statistical footing through the designof probabilistic models. I will describe a few representative examples of this work here.

In 1999, Camproux et al. presented a hidden Markov model of local protein struc-ture [56]. Much along the lines of the work by Hunter and States [42], their modelrepresented a protein chain as a number of overlapping fragments. They used fragmentsof length four, where the internal structure of a fragment was captured through the dis-tances between the Cα atoms (Figure 12). The sequential dependencies along the chainwere modeled by a Markov chain. More precisely, each state in the HMM correspondedto specific parameters for a four-dimensional Gaussian distribution, modeling the de-scriptors (d1, d2, d3, d4). Given the data, the optimal number of states was found to be

21

Cα

Cα

Cα

Cα

d1

d2

d3

Cα

Cα Cα

Cα

d1

d2

d3

Cα

Cα Cα

Cα

d1

d2

d3

Figure 12: The representation used by Camproux et al.. Three overlapping fragmentsare shown. Each fragment is represented by three distances d1, d2, d3 plus an additionaldescriptor d4 (not shown) that measures the distance from the fourth Cα to the planedefined by the three first. Note that the d3 distance of any fragment corresponds to d1

in fragment following it.

12. Later, this number was extended to 27 on a larger data set [57]. The model wastrained without using any prior knowledge about local structure, but still convincinglyreproduced meaningful transitions between secondary structure regions.

A similar approach was later presented by de Brevern, Etchebest and Hazout [58].The authors proposed a protein representation given by overlapping 5-residue long frag-ments, using the 8 internal dihedral angles as degrees of freedom. Again, an HMM wasused to model transitions between fragments. The states of the HMM were found byclustering all 8-dihedral vectors in a data set of structures, and an optimum of 16 stateswas reported. The prediction aspect of the model was investigated in some detail. Inparticular, a complex scheme was presented to allow for prediction based on amino acidsequence, despite the fact that this information was not directly included in the model.The scheme worked as follows: After the model had been trained, the entire training setwas annotated with the best matching HMM state at each position. Based on this anno-tation, each HMM state was then associated with an amino acid frequency table. Thistable reflected the amino acid preferences in a sequence region surrounding the statewhenever this state occurred in the annotated training set. This allowed the authors,for a given input sequence, to predict the best protein block for each position. They re-ported reasonably good accuracy when reconstructing proteins structures based on theseblocks. In a later paper the authors presented further analyses of the original model [59],and recently, a new HMM was proposed that modeled 11-residue fragments composedof 7 of the original fragments, in order to capture longer range dependencies [60].

Both of the above approaches are problematic in the way they represent the proteinbackbone. The four-distance representation by Camproux et al. has the problem that thedistances are not necessarily consistent within a fragment. There are combinations of(d1, d2, d3, d4) values that simply do not correspond to a 3 dimensional conformation ofatoms. This problem is avoided by the dihedral fragment representation of de Brevernet al., but there is another consistency problem: for both these models, the representa-

22

tion at one position partially overlaps the representation of the next. In the first method,the Cα-Cα distance d3 at one position is actually identical to the distance d1 of the next(see Figure 12), while the second method shares a number of angular degrees of free-dom between each position. This is problematic for sampling and prediction purposes,since sampled values at different positions will generally not be consistent, and somemethod is required to assemble the fragments. Both methods focus on prediction oflocal structure and solve this problem either by averaging over the predicted values ateach position or by superimposing overlapping fragments. However, it is not clear howto design a formally correct sampling strategy using these models.

Several years earlier, Dowe et al. had proposed a cleaner representation of local struc-ture. The authors basically extended the work of Hunter and Staten but representedprotein local structure using the (φ, ψ) dihedral angles of each residue. These angleswere modeled using the von Mises distribution, which correctly handled the inherentperiodicity of the angular data. While their original method was primarily a clusteringalgorithm in (φ, ψ) space, the approach was extended to an HMM in a later study [61].The authors identified some problems associated with modeling the two angles inde-pendently, which they attempted to solve using transformations of the angular data, butotherwise, their model was remarkable simple. Although the model does not incorporateamino acid sequence information, and is therefore not directly applicable for simulationor prediction purposes, this work demonstrated how angular distributions can be usedto model local protein structure in an elegant way.

The methods described above are all fundamentally geometrical, in that they do nottake amino acid or secondary structure information into account directly in the designof the model. For simulation and prediction purposes, it is important that models canbe conditioned on any such available input information. The first model to rigorouslysolve this problem was the HMMSTR method by Bystroff, Thorsson and Baker, from2004 [62]. It was a multi-track HMM that simultaneously modeled sequence, sec-ondary structure, supersecondary structure and dihedral angle information. The modelcan be viewed as a probabilistic version of the I-sites fragment library described pre-viously [31]. To train the model, all fragments from the I-sites library were initiallyconverted into simple linear chains of states, each chain consisting of a single state foreach position. These simple linear models were then merged into one large state diagrambased on similarity between states. This provided an initial estimate of both transitionand emission probabilities for the model, which were then further optimized using theEM algorithm. Finally, several procedures were implemented to reduce the complexityof the model. Three different types of models were trained, optimized for either sec-ondary structure, supersecondary structure or the prediction of backbone angles. Eventhough HMMSTR is based on a fragment library, it avoids the representation issuesmentioned for some of earlier methods. The sequential dependency in HMMSTR ishandled exclusively by the hidden Markov chain, and emitted symbols at different posi-tions are independent given the hidden sequence. The authors identified a wide range ofpossible applications for the model, including gene finding, secondary structure predic-tion, protein design, sequence comparison and dihedral angle prediction, and presentedimpressive results for several of these application. Unfortunately, for the purpose of

23

protein simulation or prediction, HMMSTR had one significant drawback. The (φ, ψ)dihedral angle output was discretized into a total of 11 bins, representing a significantlimitation on the structural resolution of the model.

Recently, Hamelryck, Kent and Krogh presented a probabilistic model of local struc-ture designed for the Cα representation of proteins [6]. As a continuation of this work, Ideveloped a corresponding model for the full backbone representation of proteins. Bothmodels can be seen as a logical extension of the different methods described above.They are sequential probabilistic models that include amino acid and secondary struc-ture information, and model the angular degrees of freedom of the proteins backbonein continuous space. They allow for the direct sampling of backbone structures usingthe forward-backtrack algorithm, and can therefore be used as proposal distributions inMCMC simulations. In the full backbone model, the (φ, ψ) angular degrees of freedomare modeled using the bivariate von Mises distribution, which correctly captures thecorrelation between φ and ψ and avoids the dependency problems mentioned by Doweet al. The model (TorusDBN) is presented in Chapter 2.

Figure 13: A pivot move, corresponding to a change in the dihedral angle pair atposition 37 (marked in yellow) in the native structure of protein G. While the extentof the change of angles is reasonable viewed from a local perspective, the move willbe rejected because it causes a clash further down the chain. This effect is even morepronounced for larger proteins.

Local Moves

One of the benefits when moving from molecular dynamics to Markov Chain MonteCarlo simulations is an increased flexibility in the choice of moves. As long as detailedbalance and ergodicity can be proven, any move is allowed. This fact has motivated alarge variety of moves to be proposed in the literature over the last decades. One of themost simple moves is the pivot-move, which involves the modifications of a single or afew dihedral angles at an arbitrary position in the chain. Despite its crudeness, this move

24

ij

Figure 14: Crank-shaft move.

is quite effective and moves of this type are frequently included in simulation methods.However, around the densely packed native structure, a random modification of an angleoften leads to clashes further down the chain (see Figures 13). Many attempted pivotmoves will therefore be rejected, leading to an inefficient sampling of this part of theconformational space. Increased simulation efficiency can be obtained by introducingan additional move that modifies a number of angles in a small region of the chain, whilekeeping all positions outside the region fixed. Such moves are called local moves.

In this introduction, I will make a distinction between local moves used in MCMCsimulations, and local moves used for optimization (heuristic moves). While the designof the first type is constrained by considerations of detailed balance, the latter has nosuch requirement, and can therefore be considered a less complex problem.

MCMC Moves

One of the oldest and most simple local moves for MCMC simulations is the crank-shaft move, illustrated in Figure 14. The idea is to use the vector connecting two atomsi and j as a rotation axis around which all atoms i < k < j are revolved. It wasfirst introduced for lattices by Verdier and Stockmayer in 1962 [63], and was later gen-eralized to off-lattice (continuous) simulations by Kumar et al. [64]. For completelyflexible polymer chains, the crank-shaft move is very efficient. However, for proteins, itis more problematic. The rotation involved generally modifies both the dihedral angleand bond angle at the end point atoms. In protein simulations, bond angles are typicallyconstrained rather strictly around their ideal values, and the flexibility of the crank-shaftmove is therefore much reduced6.

The more recent research in local MCMC moves has roughly progressed along twodifferent paths, which I will refer to as the concerted rotation approach, and the config-urational bias approach.

Concerted Rotation The concerted rotation type moves are built around the ideaspresented in the classic paper by Go and Scheraga from 1970 [65]. The paper containeda rigorous analysis of the geometrical problem behind local moves: given a segmentof fixed length at an arbitrary position in the chain, how can the dihedral angles withinthis segment be modified so that the positions outside the segment remain fixed? It wasshown that the constraints necessary to fix the end points of the segment involve six

6This is actually not true for the Cα-only representation of proteins, where the pseudo-bond angle θ isnot restricted around a particular value.

25

1

0

prerotation

2

3 5

4

postrotation

Figure 15: Schematic of the concerted rotation method by Theodorou and coworkers.The initial drive dihedral angle is modified freely, after which a solution is found forthe remaining 6 dihedral angles so that only the positions of atoms 1–4 are modified.

degrees of freedom, implying that the segment should span at least 6 dihedral angles.The authors demonstrated that the value of these dihedral angles could be determinedby numerically solving a single equation of one variable.

The first fully functional concerted rotation type move was provided by Theodorouand coworkers in 1993 [66]. They used a 7 dihedral angle region, modifying the first an-gle freely (the driver angle), and updating the remaining 6 by solving Go and Scheraga’sequation. These two steps are commonly referred to as the prerotation, and the postrota-tion, respectively. For each move, this approach modified a total of four atom positions(See figure 15). In contrast to the study by Go and Scheraga, Theodorou and coworkerswere concerned with the implementation issues of the move, and demonstrated severalinteresting properties of the solution procedure. For instance, they presented strategiesfor correctly handling the multiple solutions of the equation in order to maintain de-tailed balance. Most importantly, however, this study was the first to identify that a biaswas induced by the constraints on the end point, which must be compensated for by anappropriate Jacobian determinant. This had been neglected in previous implementationsof Go and Scheraga’s ideas [67–69]. In later papers, a symmetric version of the algo-rithm was presented, using two driver angles [70, 71], referred to as the intramolecularrebridging move (Figure 16). It was similar to the original method, but some improve-ments were made on the numerical solution procedure, making it possible to excludecertain regions of the search space when solving the postrotation problem. In addition,a simplified version of the algorithm was proposed, which only required finding a singlesolution to the equation [70].

While Theodorou and coworkers were concerned with general polymers, Hoffmann

1

0

prerotation prerotationpostrotation

2

3 5

4 6

Figure 16: The intramolecular rebridging or double driven concerted rotation move.The first and last dihedral angles are modified stochastically, and new positions foratom 1 and 5 are computed. Subsequently, the chain is rebridged with a new atom 2,3, 4 trimer, after which the remaining 6 dihedral angles are updated. Apart from theadditional driver angle, it is similar to the original concerted rotation move.

26

and Knapp developed similar techniques specifically for proteins. In particular, in theirapplications, the ω dihedral angle around the C−N bond was kept fixed at 180◦, corre-sponding to a trans state. Together with Irgens-Defregger, Knapp was actually amongthe first to implement a method based on the ideas of Go and Scheraga, but originallydid not include the necessary Jacobians to obtain unbiased sampling [68, 69]. This wascorrected in the later work together with Hoffmann [72,73]. Their window move distin-guishes itself from the methods by Theodorou’s group by not specifying the prerotationangles in advance, and formulating their method as a general procedure for windowsof any size. Three Cα positions within a window are chosen randomly. The centralatom marks the position of a chain break, while the other two mark the left and rightjoints, corresponding to the dihedral angles that will be modified during postrotation.All remaining (φ, ψ) angles within the window will be modified during prerotation. SeeFigure 17 for a few examples of such moves.

C

N

NCα Cα

CC N

N

Prerotation Postrotation

Cα

C

N

NCα Cα

CC N

N

PrerotationPostrotation Postrotation

Cα

C

N

NCα Cα

CC N

N

Prerotation PrerotationPostrotation Postrotation Postrotation

Cα

CN

Cα

C

Cα

C

Cα

Cα

C

Figure 17: Three examples of the window move by Hoffmann and Knapp. Withina window, three random Cα atoms are chosen, corresponding to the postrotationaldegrees of freedom (grey). Any other Cα atoms within the window are modified duringprerotation (white).

The early concerted rotation type moves described above are clearly highly similar,mainly differing in the number of prerotated angles, and the distribution of prerotationaland postrotational angles within the local move window. More recently, a few exten-sions to this basic formula have been presented.

In 1999, Wedemeyer and Scheraga revisited the original geometrical problem, re-casting it to a new form, for which solutions can be found as roots in a polynomialof degree 16 [74]. The new form facilitates the solution process, and gives a better

27

understanding of the underlying problem, but is otherwise equivalent to the original de-scription. The new formalization was later generalized by Dill and coworkers, allowingthe postrotational dihedrals to be be chosen at arbitrary positions in the chain, ratherthan consecutively [75].

All the methods above basically assume the dihedral angles of the chain to be theonly degrees of freedom, assuming bond angles and bond lengths to be fixed. Recently,Ulmschneider and Jorgensen presented a concerted rotation move method called CRA,where bond angles were included as full degrees of freedom [76]. With this approach,there are five angular degrees of freedom per residue, rather than the two involved in thewindow moves by Hoffmann and Knapp. This greatly reduces the minimum range ofthe move, since the postrotation can now be handled within two neighboring residues(see Figure 18). To maintain the range and efficiency of previous methods, the authorsincreased the number of prerotated angles. For this approach to be feasible, it was nec-essary to devise a strategy ensuring that the chain break induced during prerotation didnot becomes excessively large, which would create problems for the postrotation step.Instead of sampling the prerotation angles uniformly, the authors used a distributionwith a bias towards small gaps at the chain break. This idea was originally developed asa semi-local move by Favrin, Irbäck and Sjunnesson [77], but also proved ideal for thisapplication. Ulmschneider and Jorgensen demonstrated their method to outperform theclassic concerted rotation method considerably.

C

C

N

NCα Cα

Cα CN

CαCN

CN Cα


Chain break

Figure 18: Ulmschneider and Jorgensen’s CRA move, where bond-angles are includedas degrees of freedom. Note the large number of prerotational degrees of freedom,which are sampled from a distribution with bias towards small deviations at the break-point.

Configurational Bias The configurational bias methods have their roots back to the1955 paper of Rosenbluth and Rosenbluth, which contains a description of a method forgenerating random self-avoiding protein chains on a lattice [78]. The method constructsa chain one atom at a time in an iterative fashion, employing a weighting scheme to en-sure unbiased sampling. These weights are now often referred to as Rosenbluth weights.

Siepmann and Frenkel built the configurational bias (CB) MCMC move around theseideas, and demonstrated how Boltzmann factors could be incorporated into the Rosen-bluth weighting scheme in order to obtain a more efficient sampling [79]. Several sub-sequent studies generalized this scheme to continuous space [80,81]. It should be notedthat all these moves resampled the chain from a randomly selected point to one of theend points, and were therefore not local moves.

Escobedo and de Pablo were the first to turn these ideas into a local move [82], calledextended continuum configurational bias (ECCB). In their approach, the chain segment

28

is rebuilt, one atom at a time, in a self-avoiding manner, but using a simple geometricconstraint on the possible positions. Each atom can only be placed at positions that arecompatible with the requirement that the chain can still be closed using the remainingatoms. As illustrated in Figure 19, the first atoms will be placed rather freely, but as thegap becomes smaller, the range of allowed angles will become increasingly narrow, andthe final atom will be confined to the circle defined by the bond lengths to the previousand next atom (i.e. a crankshaft move).

1

2

3

4

5

Figure 19: Extended Continuum conformational bias local move. The dark grey atomsare stationary, while the positions of the white atoms are updated. The light greyarea illustrates the region of possible positions for a given atom. For simplicity, thealgorithm is here presented in 2-D. In reality the grey areas are cones, and the twopossible solutions for the position of atom 4 corresponds to a circle of all solutionswith the correct bond length distance to both atom 3 and 5.

The ECCB method can be fairly easily implemented and generally results in largerstructural changes than the concerted rotation type moves. While the move seems towork well for fully flexible chains, it has been reported to be problematic for moreconstrained representations (for instance when using fixed bond lengths and bond an-gles) [83, 84]. At least three methods have been proposed to overcome this problem.All three methods abandon the simple geometric approach from the ECCB method,and reintroduce the energy weighting scheme from the CB method to guide solutionstowards low energy structures. To ensure that the chain can be closed, alternative strate-gies were required to bias the regrowing atoms towards the end of the gap. Uhlherr’sinternal configurational bias method (ICB) used a hypothetical spring potential, biasingeach regrowing atom towards the smallest possible distance to the end point, and used aconcerted rotation type approach for the last three atoms [85]. Chen and Escobedo useda similar approach, but tried to estimate the probability of finding a regrowing atom ata certain distance from the end point using a short preliminary simulation, and solvedthe positioning of the final atom as a weighted version of the crank-shaft move [84].Finally, Wick and Siepmann proposed a similar approach using a self-adapting presim-ulation scheme [83].

Common Themes Despite their distinct backgrounds, the concerted rotation andconfigurational bias types of local move seem to be converging. The more recent con-figurational bias approaches all treat the last atoms of the local move segment as aspecial case, and thus effectively divide the local move problem into a prerotation andpostrotation step, similar to the concerted rotation approaches.

29

Regarding the prerotation, a major difference between the two types of method hasbeen that the prerotation played a minor role in concerted rotation type moves, whileis was paramount to the configurational bias approaches. However, this distinction isdisappearing. For instance, as we saw, Ulmschneider and Jorgensen proposed a con-certed rotation type method with a large number of prerotation angles, with similarrange (maximum window size) as the configurational bias methods.

The term postrotation can be conveniently used in general to describe the special caseof positioning the final atoms for both types of method. This is done either througha crank-shaft type move, or by the more complex numerical solution inspired by Goand Scheraga. The optimal choice of method depends on the degree to which the rep-resentation of the chain is constrained. If the chain is entirely flexible, the crank-shaftsolution is quite efficient. However, in cases where bond angles and bond lengths arefixed, the crank-shaft approach breaks down, and a concerted rotation type approachwill be necessary.

The method that we present in Chapter 3 can be seen as a natural extension of bothtypes of local moves. We investigate how best to incorporate a prior distribution intoa prerotation move, an idea that is similar to the weighting schemes in the configura-tional bias methods. In our case, however, we have a generative prior, which makes itpossible to do the prerotation without the step-wise procedure and the discretization ofconformational space. For the postrotation, we use the usual concerted rotation ideas,but demonstrate that in the case of our representation, in which bond angles are includedas degrees of freedom, the problem has a simple analytical solution.

Heuristic Moves

In this section, I will cover a number of local moves for which detailed balance cannotor has not been proven. While such moves are not ideal for MCMC simulations, theycan be very useful in energy minimization schemes. The list of such methods is long,and I will therefore mention only a few representative examples.

The random tweak method by Levinthal and coworkers was one of the early attemptsat a local move [86]. The method consisted of two steps. First, new dihedral angleswere sampled uniformly for a segment of a chain, generally introducing a chain-break.Subsequently, an iterative refinement technique was used to return the chain to a closedstate, using Lagrange multipliers to express the necessary constraints. The new chainconformation was accepted when the constraint distances fell below some predefinedcut-off value. A similar approach was employed in the LMProt algorithm by da Silva,Degrève and Caliri, but in their case, small deviations were introduced in Cartesianspace, by directly modifying atom positions [87]. Also in their approach, an iterativeprocess using Lagrange multipliers was used to enforce the constraints.

Inspiration from the Loop Closure Field A well known problem from homol-ogy modeling, called the loop closure problem is roughly equivalent to the local moveproblem, and the loop closure literature is therefore an abundant source of inspirationfor the design of local move methods. The loop closure problem arises when one at-

30

tempts to predict the structure of a protein from homologues for which the structure isknown. For close homologues, large parts of the structure will generally be evolutionaryconserved, but for certain regions (typically loop regions), the alignment will fail, leav-ing gaps in the structure. To obtain a complete structure, it is necessary to bridge thesegaps, a problem which is clearly equivalent to the local move problem. The homologymodeling literature contains various methods to accomplish this task. However, in thecontext of loop closure, the property of detailed balance is typically not considered im-portant. The goal is simply to find a number of low energy loop conformations that willbridge the gap. Loop closure algorithms are therefore not generally applicable as localmoves in MCMC simulations, but they can be directly used for energy minimizationpurposes in protein structure prediction.

The fragment assembly approach that I described in the context of local structure pre-diction is also used frequently for loop closure. In fact, Jones and Thirup’s original de-scription of fragment assembly from 1986 included loop closure as an application [46].Koehl and Delarue later followed their lead and developed a systematic loop closuremethod with significantly more data at their disposal [88]. Their idea was to scan adatabase for structural fragments with Cα-Cα distances similar to the distances of theatoms immediately preceding and succeeding a gap, attempting to bridge the gap with asingle fragment. After being subjected to some additional filters, the best fragment wasinserted, and an energy minimization scheme was used to relax the chain. It should benoted that the length of loops that could be closed by this method was rather limited,since individual fragments were required to span the entire gap.

Baker and coworkers used the Rosetta fragment assembler in a similar approach, butusing a more advanced scheme of fragment selection and an explicit treatment of longloops [89]. In this method, fragments were selected based on geometric properties,sequence profile similarity, secondary structure similarity, and the known secondarystructure of the atoms preceding and succeeding the gap. To optimize the fit of result-ing loops, several fine-tuning procedures were implemented. Convincing results werepresented, both for short and long loops.

Finally, as an alternative to the fragment-based approaches, Jacobsen et al. proposed(φ, ψ) histograms as the basis for a loop closure method [90]. These histograms wereused to create thousands of randomly generated conformations from each end of thegap, selecting those pairs of conformations that met at the middle. For each such pair,the atom position at the midpoint was calculated as the average of the corresponding po-sition in the two fragments. The many generated solutions were then filtered, clustered,subjected to side-chain optimization and energy-minimized before the final candidatewas selected. The authors demonstrated an improved accuracy over previous methods,but reported that the sampling method becomes problematic for loops larger than 15residues.

Inspiration from Robotics It has recently become clear that some of the methodsdesigned in the field of robotics are directly applicable as loop closure or local movealgorithms. More precisely, the problem of moving the end point of a multi-jointedmanipulator (robot arm) to a fixed position in space is virtually identical the problem of

31

bridging a gap in a protein structure. Taking advantage of this, Canutescu and Dunbrackused the ideas of the cyclic coordinate descent (CCD) algorithm from robotics [91] asinspiration for an efficient loop closure method with the same name [92].

The CCD algorithm is an iterative relaxation algorithm. Initially a candidate structureis created for the atoms in the loop, including a three atom overlap anchor at the far endof the gap (see Figure 20). This initial structure can be generated in any fashion, for in-stance using random angles or an assembly of fragments. The segment is then improvedthrough a greedy algorithm, iterating over the loop atoms one at a time. For each atom,the optimal dihedral rotation is calculated, bringing the anchor as close as possible to thetarget structure. The procedure is repeated until the distance between the anchor and thetarget drops below a predefined cut-off value. The move is intuitively clear and easilyimplemented. The performance depends on the loop segment length, but remarkably,in contrast to the methods described previously, the percentage of successfully closedloops increases with loop size. Canutescu and Dunbrack also presented a strategy toincorporate constraints in the CCD move in the the form of Ramachandran probabilitymaps, thus enforcing realistic local structure in the closed loops.

N

NCα

Cα

C

C

C

N

CαCαN

C

Figure 20: A single step in the CCD move. One dihedral is changed in order to min-imize the deviation between the anchor atoms at the end point. Note that the loop inthis example is significantly shorter than the length recommended in the CCD paper.

In Chapter 1, I present a method very similar to the CCD algorithm, but for the Cα-only representation of proteins. Since this representation involves both a pseudo-angleand a pseudo-dihedral angle for each atom, the optimal update in each iteration involvesa full rotation, which can be found using the singular value decomposition techniqueknown from linear algebra.

Overview

In the initial phase of my Ph.D., my supervisor Thomas Hamelryck had begun the devel-opment of a probabilistic model for the local structure of proteins using the Cα represen-tation (the FB5-HMM). At that time, the goal of our efforts was developing a completeprotein prediction method for the Cα representation, with the FB5-HMM at the core ofour simulations.

To increase the efficiency of our method, we sought a local move algorithm thatcould be naturally combined with the FB5-HMM. A study by Canutescu and Dunbrackdemonstrated that Ramachandran probability maps could be integrated into their CCDlocal move method [92], which inspired us to develop a similar method for the Cα

representation, resulting in the FCCD algorithm described in Chapter 1.While the FCCD algorithm worked as intended, we found that there were several

problematic aspects of our approach to protein structure prediction. First, we wereunable to demonstrate detailed balance for the FCCD algorithm, which rendered it lessappealing than first assumed. Second, it proved to be difficult to design a good energyfunction for the Cα-representation. In particular, we were not successful in obtainingβ-sheets in our simulations, due to problems regarding the definition of hydrogen bondenergies in this coarse-grained representation. These problems were the main drivingforce behind the development of the TorusDBN, a probabilistic model of local structurefor the full backbone representation. The success of this model depended on efficientsampling and parameter estimation techniques for a bivariate von Mises distribution,which turned out to be more involved than that of the FB5 distribution. With the helpof Kanti Mardia and Charles Taylor, these issues were eventually resolved (AppendixA). Through a rigorous comparison to several of the established methods in the field,we demonstrated that the TorusDBN was comparable in performance to the currentstate-of-the-art. In addition, the probabilistic nature of the model makes it an appealingalternative to fragment libraries in the context of MCMC simulations. The resultingpaper (Chapter 2) represents the main contribution of my Ph.D.

While preparing for our current participation in CASP (CASP 8), different strategieswere discussed on how the TorusDBN can most efficiently be used in a simulation.Some considerations on detailed balance are included in Appendix B. In collaborationwith Jesper Ferkinghoff-Borg, I also revisited the local move problem, intent on findinga way to incorporate the probabilistic model so that the local move and the pivot-typemoves of the TorusDBN both corresponded to the same proposal distribution. This workled to the manuscript in Chapter 3.

Finally, some initial work has begun on analyzing the model structure of the Torus-DBN, to uncover the level of structural detail captured by the model. Appendix Cincludes a first step in this direction, illustrating how transitions between different sec-

33

34

ondary structure regions are handled by the model.A short description of each of the chapters follows here. Detailed introductions are

included in the individual papers, and a summary is included in Concluding Remarks.

Chapter 1: Full cyclic coordinate descent: solving the proteinloop closure problem in Cα spaceThis chapter describes a local move method designed for the Cα representation of pro-teins. It was inspired by the Cyclic Coordinate Descent (CCD) algorithm by Canutescuand Dunbrack [92], but involves a different scheme for calculating the optimal rotationat each step of the iterative procedure. We demonstrate that angular constraints canbe employed to guide the closure algorithm towards locally realistic structures, whichis relevant for combining it with the FB5-HMM. Finally, the model is shown to havea high rate of successful closures, and to be reasonably efficient despite the iterativeprocedure.

Chapter 2: A generative, probabilistic model of local proteinstructureIn chapter 2, I present the TorusDBN, a probabilistic model of local protein structure forthe full-backbone representation of proteins. Based on an amino acid input sequence,and optionally a predicted secondary structure sequence, the model can produce sam-pled protein structures with meaningful local structure. Also short segments of proteinscan be resampled, corresponding to pivot-like moves biased towards correct local struc-ture. The model represents protein structures using (φ, ψ) angles, modeled using abivariate angular distribution. This makes it possible to capture structural motifs at highresolution, and avoids the discretization of structural space that characterizes fragmentbased methods.

Chapter 3: Monte Carlo sampling of proteins: local movesconstrained by a native-oriented structural priorThe TorusDBN provides us with a pivot-like move method, where the (φ, ψ) anglesare constrained towards values that are locally feasible. This last chapter investigateshow to design a local move method that adheres to the same local structural constraints,by incorporating a prior distribution in a concerted rotation type local move method.Although the results presented in this chapter are preliminary, they give an indication ofthe potential of our proposed method.

35

Bibliography[1] Bishop C (2006) Pattern Recognition and Machine Learning (Springer-Verlag

New York, Inc. Secaucus, NJ, USA).

[2] Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis(Cambridge University Press).

[3] Poritz A (1988) Hidden Markov models: a guided tour. ICASSP-88, InternationalConference on Acoustics, Speech, and Signal Processing pp 7–13.

[4] Rabiner L (1989) A tutorial on hidden Markov models and selected applicationsin speech recognition. Proceedings of the IEEE 77: 257–286.

[5] Cawley SL, Pachter L (2003) HMM sampling and applications to gene findingand alternative splicing. Bioinformatics 19 Suppl 2: ii36–41.

[6] Hamelryck T, Kent J, Krogh A (2006) Sampling realistic protein conformationsusing local structural bias. PLoS Comput Biol 2: e131.

[7] Ghahramani Z (1998) Learning dynamic Bayesian networks. Lect Notes ComputSci 1387: 168–197.

[8] Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incompletedata via the EM algorithm. J Roy Stat Soc B Stat Meth 39: 1–38.

[9] Baum L, Petrie T, Soules G, Weiss N (1970) A maximization technique occurringin the statistical analysis of probabilistic functions of Markov chains. Ann MathStat 41: 164–171.

[10] Nielsen SF (2000) The stochastic EM algorithm: estimation and asymptotic re-sults. Bernoulli 6: 457–89.

[11] Mardia K, Jupp P (2000) Directional statistics (Wiley New York).

[12] von Mises R (1918) Uber die "Ganzzahligkeit" der Atomgewichte und verwandteFragen. Phys Z 19: 490–500.

[13] Kent J (1982) The Fisher-Bingham distribution on the sphere. J Roy Stat Soc BStat Meth 44: 71–80.

[14] Mardia K (1975) Statistics of directional data. J Roy Stat Soc B Stat Meth 37:349–393.

[15] Rivest L (1988) A distribution for dependent unit vectors. Comm Stat Theor Meth17: 461–483.

[16] Singh H, Hnizdo V, Demchuk E (2002) Probabilistic model for two dependentcircular variables. Biometrika 89: 719–723.

36

[17] Mardia KV, Taylor CC, Subramaniam GK (2007) Protein bioinformatics and mix-tures of bivariate von Mises distributions for angular data. Biometrics 63: 505–512.

[18] Anfinsen C (1973) Principles that govern the folding of protein chains. Science181: 223–230.

[19] Corey R, Pauling L (1953) Fundamental dimensions of polypeptide chains. ProcR Soc Lond B Biol Sci 141: 10–20.

[20] Marti-Renom M, Stuart A, Fiser A, Sanchez R, Melo F Sali A (2000) Comparativeprotein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct29: 291–325.

[21] Duan Y, Wang L, Kollman P (1998) The early stage of folding of villin headpiecesubdomain observed in a 200-nanosecond fully solvated molecular dynamics sim-ulation. Proc Natl Acad Sci USA 95: 9897–9902.

[22] Jayachandran G, Vishal V, Pande V (2006) Using massively parallel simulationand Markovian models to study protein folding: Examining the dynamics of thevillin headpiece. J Chem Phys 124: 164902.

[23] Daggett V, Fersht A (2003) The present view of the mechanism of protein folding.Nat Rev Mol Cell Biol 4: 497–502.

[24] Engh RA, Huber R (1991) Accurate bond and angle parameters for X-ray proteinstructure refinement. Acta Crystallogr A 47: 392–400.

[25] Ramachandran GN, Ramakrishnan C, Sasisekharan V (1963) Stereochemistry ofpolypeptide chain configurations. J Mol Biol 7: 95–99.

[26] Levitt M (1976) A simplified representation of protein conformations for rapidsimulation of protein folding. J Mol Biol 104: 59–107.

[27] Oldfield T, Hubbard R (1994) Analysis of Cα geometry in protein structures.Proteins 18: 324–337.

[28] Dill K, Bromberg S, Yue K, Fiebig K, Yee D, Thomas P Chan H (1995) Principlesof protein folding–A perspective from simple exact models. Protein Sci 4: 561–602.

[29] Hutchinson E, Thornton J (1996) PROMOTIF–A program to identify and analyzestructural motifs in proteins. Protein Sci 5: 212–220.

[30] Wright P, Dyson H, Lerner R (1988) Conformation of peptide fragments of pro-teins in aqueous solution: implications for initiation of protein folding. Biochem-istry 27: 7167–7175.

[31] Bystroff C, Baker D (1998) Prediction of local structure in proteins using a libraryof sequence-structure motifs. J Mol Biol 281: 565–577.

37

[32] Viguera A, Serrano L (1995) Experimental analysis of the Schellman motif. J MolBiol 251: 150–160.

[33] Munoz V, Serrano L (1995) Analysis of i, i+5 and i, i+8 hydrophobic interactionsin a helical model peptide bearing the hydrophobic staple motif. Biochemistry 34:15301–15306.

[34] Blanco F, Serrano L (1995) Folding of protein G B1 domain studied by the con-formational characterization of fragments comprising its secondary structure ele-ments. Eur J Biochem 230: 634–649.

[35] de Alba E, Jimenez M, Rico M, Nieto J (1996) Conformational investigation ofdesigned short linear peptides able to fold into beta-hairpin structures in aqueoussolution. Fold Des 1: 133–144.

[36] Ilyina E, Milius R, Mayo K (1994) Synthetic peptide probe folding initiation sitesin platelet factor-4: stable chain reversal found within the hydrophobic sequenceLIATLKNGRKISL. Biochemistry 33: 13436–13444.

[37] Searle M, Williams D, Packman L (1995) A short linear peptide derived from theN-terminal sequence of ubiquitin folds into a water-stable non-native bold beta-hairpin. Nat Struct Biol 2: 999–1006.

[38] Sieber V, Moe G (1996) Interactions contributing to the formation of a beta-hairpin-like structure in a small peptide. Biochemistry 35: 181–188.

[39] Unger R, Harel D, Wherland S, Sussman J (1989) A 3D building blocks approachto analyzing and predicting structure of proteins. Proteins 5: 355–373.

[40] Kabsch W (1976) A solution for the best rotation to relate two sets of vectors.Acta Crystallogr A 32: 922–923.

[41] Rooman M, Rodriguez J, Wodak S (1990) Automatic definition of recurrent localstructure motifs in proteins. J Mol Biol 213: 327–236.

[42] Hunter L, States DJ (1992) Bayesian classification of protein structure. IEEEExpert: Intelligent Systems and Their Applications 7: 67–75.

[43] Han K, Baker D (1995) Recurring local sequence motifs in proteins. J Mol Biol251: 176–187.

[44] Han K, Baker D (1996) Global properties of the mapping between local amino acidsequence and local structure in proteins. Proc Natl Acad Sci USA 93: 5814–5818.

[45] Han K, Bystroff C, Baker D (1997) Three-dimensional structures and contextsassociated with recurrent amino acid sequence patterns. Protein Sci 6: 1587–1590.

[46] Jones TA, Thirup S (1986) Using known substructures in protein model buildingand crystallography. EMBO J 5: 819–822.

38

[47] Rooman M, Kocher J, Wodak S (1991) Prediction of protein backbone conforma-tion based on seven structure assignments. Influence of local interactions. J MolBiol 221: 961–979.

[48] Vasquez M, Scheraga H (1988) Calculation of protein conformation by the build-up procedure. Application to bovine pancreatic trypsin inhibitor using limited sim-ulated nuclear magnetic resonance data. J Biomol Struct Dyn 5: 705–755.

[49] Simon I, Glasser L, Scheraga H (1991) Calculation of protein conformation as anassembly of stable overlapping segments: application to bovine pancreatic trypsininhibitor. Proc Natl Acad Sci USA 88: 3661–3665.

[50] Bowie J, Eisenberg D (1994) An evolutionary approach to folding small alpha-helical proteins that uses sequence information and an empirical guiding fitnessfunction. Proc Natl Acad Sci USA 91: 4436–4440.

[51] Srinivasan R, Rose G (1995) LINUS: a hierarchic procedure to predict the fold ofa protein. Proteins 22: 81–99.

[52] Moult J (2006) Rigorous performance evaluation in protein structure modellingand implications for computational biology. Philos Trans R Soc Lond B Biol Sci361: 453–458.

[53] Jones D (1997) Successful ab initio prediction of the tertiary structure of NK-lysin using multiple sequences and recognized supersecondary structural motifs.Proteins 1: 185–191.

[54] Simons KT, Kooperberg C, Huang E, Baker D (1997) Assembly of protein tertiarystructures from fragments with similar local sequences using simulated annealingand Bayesian scoring functions. J Mol Biol 268: 209–225.

[55] Jauch R, Yeo H, Kolatkar P, Clarke N (2007) Assessment of CASP7 structurepredictions for template free targets. Proteins 69 Suppl 8: 57–67.

[56] Camproux A, Tuffery P, Chevrolat J, Boisvieux J, Hazout S (1999) Hidden Markovmodel approach for identifying the modular framework of the protein backbone.Protein Eng Des Sel 12: 1063–1073.

[57] Camproux A, Gautier R, Tufféry P (2004) A hidden Markov model derived struc-tural alphabet for proteins. J Mol Biol 339: 591–605.

[58] de Brevern A, Etchebest C, Hazout S (2000) Bayesian probabilistic approach forpredicting backbone structures in terms of protein blocks. Proteins 41: 271–287.

[59] Etchebest C, Benros C, Hazout S, de Brevern A (2005) A structural alphabet forlocal protein structures: improved prediction methods. Proteins 59: 810–827.

[60] Benros C, de Brevern A, Etchebest C, Hazout S (2006) Assessing a novel approachfor predicting local 3D protein structures from sequence. Proteins 62: 865–880.

39

[61] Edgoose T, Allison L, Dowe D (1998) An MML classification of protein structurethat knows about angles and sequence. Pac Symp Biocomput 3: 585–596.

[62] Bystroff C, Thorsson V, Baker D (2000) HMMSTR: a hidden Markov model forlocal sequence-structure correlations in proteins. J Mol Biol 301: 173–190.

[63] Verdier P, Stockmayer W (1962) Monte Carlo calculations on the dynamics ofpolymers in dilute solution. J Chem Phys 36: 227–235.

[64] Kumar S, Vacatello M, Yoon D (1988) Off-lattice Monte Carlo simulations ofpolymer melts confined between two plates. J Chem Phys 89: 5206–5215.

[65] Go N, Scheraga H (1970) Ring closure and local conformational deformations ofchain molecules. Macromolecules 3: 178–187.

[66] Dodd L, Boone T, Theodorou D (1993) A concerted rotation algorithm for atom-istic Monte Carlo simulation of polymer melts and glasses. Mol Phys 78: 961–996.

[67] Wakana H, Wako H, Saito N (1984) Monte Carlo study on local and small-amplitude conformational fluctuation in hen egg white lysozyme. Int J Pept ProteinRes 23: 315–23.

[68] Knapp E (1992) Long time dynamics of a polymer with rigid body monomer unitsrelating to a protein model: Comparison with the rouse model. J Comput Chem13: 793–798.

[69] Knapp E, Irgens-Defregger A (1993) Off-lattice Monte Carlo method with con-straints: long-time dynamics of a protein model without nonbonded interactions.J Comput Chem 14: 19–29.

[70] Pant P, Theodorou D (1995) Variable connectivity method for the atomistic MonteCarlo simulation of polydisperse polymer melts. Macromolecules 28: 7224–7234.

[71] Mavrantzas V, Boone T, Zervopoulou E, Theodorou D (1999) End-bridging MonteCarlo: a fast algorithm for atomistic simulation of condensed phases of long poly-mer chains. Macromolecules 32: 5072–5096.

[72] Hoffmann D, Knapp E (1996) Polypeptide folding with off-lattice Monte Carlodynamics: the method. Eur Biophys J 24: 387–403.

[73] Hoffmann D, Knapp E (1996) Protein dynamics with off-lattice Monte Carlomoves. Phys Rev E 53: 4221–4224.

[74] Wedemeyer W, Scheraga H (1999) Exact analytical loop closure in proteins usingpolynomial equations. J Comput Chem 20: 819–844.

[75] Coutsias E, Seok C, Jacobson M, Dill K (2004) A kinematic view of loop closure.J Comput Chem 25: 510–528.

40

[76] Ulmschneider J, Jorgensen W (2003) Monte Carlo backbone sampling forpolypeptides with variable bond angles and dihedral angles using concerted ro-tations and a Gaussian bias. J Chem Phys 118: 4261–4271.

[77] Favrin G, Irbäck A, Sjunnesson F (2001) Monte Carlo update for chain molecules:Biased Gaussian steps in torsional space. J Chem Phys 114: 8154–8158.

[78] Rosenbluth M, Rosenbluth A (1955) Monte Carlo calculation of the average ex-tension of molecular chains. J Chem Phys 23: 356–359.

[79] Siepmann J, Frenkel D (1992) Configurational bias Monte Carlo: a new samplingscheme for flexible chains. Mol Phys 75: 59–70.

[80] Frenkel D, Mooij G, Smit B (1992) Novel scheme to study structural and ther-mal properties of continuously deformable molecules. J Phys Condens Matter 4:3053–3076.

[81] de Pablo J, Laso M, Suter U (1992) Simulation of polyethylene above and belowthe melting point. J Chem Phys 96: 2395–2403.

[82] Escobedo F, de Pablo J (1995) Extended continuum configurational bias MonteCarlo methods for simulation of flexible molecules. J Chem Phys 102: 2636–2652.

[83] Wick C, Siepmann J (2000) Self-adapting fixed-end-point configurational-biasMonte Carlo method for the regrowth of interior segments of chain molecules withstrong intramolecular interactions. Macromolecules 33: 7207–7218.

[84] Chen Z, Escobedo F (2000) A configurational-bias approach for the simulation ofinner sections of linear and cyclic molecules. J Chem Phys 113: 11382–11392.

[85] Uhlherr A (2000) Monte Carlo conformational sampling of the internal degrees offreedom of chain molecules. Macromolecules 33: 1351–1360.

[86] Shenkin P, Yarmush D, Fine R, Wang H, Levinthal C (1987) Predicting antibodyhypervariable loop conformation. I. Ensembles of random conformations for ring-like structures. Biopolymers 26: 2053–85.

[87] da Silva R, Degreve L, Caliri A (2004) LMProt: an efficient algorithm for MonteCarlo sampling of protein conformational space. Biophys J 87: 1567–1577.

[88] Koehl P, Delarue M (1995) A self consistent mean field approach to simultaneousgap closure and side-chain positioning in homology modelling. Nat Struct Biol 2:163–170.

[89] Rohl C, Strauss C, Chivian D, Baker D (2004) Modeling structurally variableregions in homologous proteins with rosetta. Proteins 55: 656–677.

[90] Jacobson M, Pincus D, Rapp C, Day T, Honig B, Shaw D Friesner R (2004) Ahierarchical approach to all-atom protein loop prediction. Proteins 55: 351–367.

41

[91] Wang L, Chen C (1991) A combined optimization method for solving the inversekinematics problems of mechanical manipulators. IEEE Trans Robot Autom 7:489–499.

[92] Canutescu A, Dunbrack Jr R (2003) Cyclic coordinate descent: A robotics algo-rithm for protein loop closure. Protein Sci 12: 963–972.

Chapter 1

Full cyclic coordinate descent: solving the proteinloop closure problem in Calpha space

Wouter Boomsma, Thomas Hamelryck

BMC Bioinformatics 6:159. 2005.

43

BioMed Central

Page 1 of 10(page number not for citation purposes)

BMC Bioinformatics

Open AccessMethodology articleFull cyclic coordinate descent: solving the protein loop closure problem in Cα spaceWouter Boomsma and Thomas Hamelryck*

Address: Bioinformatics center, Institute of Molecular Biology and Physiology, University of Copenhagen, Universitetsparken 15, Building 10, DK-2100 Copenhagen, Denmark

Email: Wouter Boomsma - [email protected]; Thomas Hamelryck* - [email protected]

* Corresponding author

AbstractBackground: Various forms of the so-called loop closure problem are crucial to protein structureprediction methods. Given an N- and a C-terminal end, the problem consists of finding a suitablesegment of a certain length that bridges the ends seamlessly.

In homology modelling, the problem arises in predicting loop regions. In de novo protein structureprediction, the problem is encountered when implementing local moves for Markov Chain MonteCarlo simulations.

Most loop closure algorithms keep the bond angles fixed or semi-fixed, and only vary the dihedralangles. This is appropriate for a full-atom protein backbone, since the bond angles can beconsidered as fixed, while the (φ, ψ) dihedral angles are variable. However, many de novo structureprediction methods use protein models that only consist of Cα atoms, or otherwise do not makeuse of all backbone atoms. These methods require a method that alters both bond and dihedralangles, since the pseudo bond angle between three consecutive Cα atoms also varies considerably.

Results: Here we present a method that solves the loop closure problem for Cα only proteinmodels. We developed a variant of Cyclic Coordinate Descent (CCD), an inverse kinematicsmethod from the field of robotics, which was recently applied to the loop closure problem. Sincethe method alters both bond and dihedral angles, which is equivalent to applying a full rotationmatrix, we call our method Full CCD (FCDD). FCCD replaces CCD's vector-based optimizationof a rotation around an axis with a singular value decomposition-based optimization of a generalrotation matrix. The method is easy to implement and numerically stable.

Conclusion: We tested the method's performance on sets of random protein Cα segmentsbetween 5 and 30 amino acids long, and a number of loops of length 4, 8 and 12. FCCD is fast, hasa high success rate and readily generates conformations close to those of real loops. The presenceof constraints on the angles only has a small effect on the performance. A reference implementationof FCCD in Python is available as supplementary information.

BackgroundMany protein structure prediction methods require an

algorithm that is capable of constructing a new conforma-tion for a short segment of the protein, without affecting

Published: 28 June 2005

BMC Bioinformatics 2005, 6:159 doi:10.1186/1471-2105-6-159

Received: 25 April 2005Accepted: 28 June 2005

This article is available from: http://www.biomedcentral.com/1471-2105/6/159

© 2005 Boomsma and Hamelryck; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

45

BMC Bioinformatics 2005, 6:159 http://www.biomedcentral.com/1471-2105/6/159


the rest of the molecule. In other words, a protein frag-ment needs to be generated that seamlessly closes the gapbetween two given, fixed end points. This problem is gen-erally called the loop closure problem, and was introduced ina classic paper by Go and Scheraga more than 30 years ago[1]. It has been the continued subject of intensive researchover many years due to its high practical importance instructure prediction.

The loop closure problem arises in at least two differentstructure prediction contexts. In homology modelling, itis often necessary to rebuild certain loops that differbetween the protein being modelled and the templateprotein [2]. The modelled loop needs to bridge the gapbetween the end points of the template's loop.

In de novo prediction, local resampling or local moves can beconsidered as a variant of the loop closure problem. Typ-ically, the conformation of a protein segment needs to bechanged without affecting the rest of the protein as a sam-pling step in a Markov Chain Monte Carlo (MCMC) pro-cedure [3]. In both homology and de novo structureprediction, the problem is however essentially the same.

The classic article by Go and Scheraga [1] describes ananalytical solution to finding all possible solutions for aprotein backbone of three residues. In this case, thedegrees of freedom (DOF) comprise six dihedral angles,ie. the backbone's (φ, ψ) angles. Another approach is touse a fragment library derived from the set of solved pro-tein structures, and look for fragments or combinations offragments that bridge the given fixed ends [4-6]. Morerecently, the loop closure problem has been tackled usingalgorithms borrowed from the field of robotics, in partic-ular inverse kinematics methods [7-9]. Still other meth-ods use various Monte Carlo chain perturbationapproaches, often combined with analytical methods[10,11,3,12]. A good overview of loop closure methodsand references can be found in Kolodny et al. (2005) [6].

Most methods assume that one is working with a full-atom protein backbone with fixed bond angles and bondlengths, so the DOF consist solely of the backbone's (φ, ψ)angles. However, in many cases not all the atoms of theprotein backbone are present in the model. In particular,a large class of structure prediction, design and in silicofolding methods makes use of drastically simplified mod-els of protein structure [13,14].

A protein structure might for example be represented by achain of Cα atoms or a chain of virtual atoms at the cent-ers of mass of the side chain atoms [15]. In these models,there is obviously no full-atom model of the protein'sbackbone available.

In the case of Cα-only models, the structure can bedescribed as a sequence of pseudo bonds, pseudo angles θand pseudo dihedral angles τ [16]. Here, the term 'pseudo'indicates that the consecutive Cα's are not actually con-nected by chemical bonds. As in the case of the protein'sbackbone, the pseudo bond lengths can be consideredfixed (typically 3.8 Å). In contrast, the pseudo bond anglesbetween three consecutive Cα atoms are most definitelynot fixed, but vary between 1.4 and 2.7 radians. Hence, aCα-only model of N residues can be represented by asequence of N - 2 pseudo bond angles θ and N - 3 pseudodihedral angles τ (Figure 1).

Most inverse kinematics approaches assume that the DOFconsist only of dihedral angles, and keep the bond anglesfixed or semi-fixed. Hence, they cannot be readily appliedto the Cα-only case without restricting the search spaceunnecessarily. In principle, fragment library based meth-ods would apply, but here the problem of data sparsityarises [17,18]. Often, no suitable fragments can be foundif the number of residues between the fixed ends becomestoo high.

In order to solve the loop closure problem in Cα space, weextend a particularly attractive approach that was recentlyintroduced by Canutescu & Dunbrack [8]. The algorithmis called Cyclic Coordinate Descent (CCD), and like manyother loop closure algorithms it derives from the field ofrobotics [19]. As pointed out by Canutescu & Dunbrack,the CCD algorithm is meant as a black box method thatgenerates plausible protein segments that bridge twogiven, fixed endpoints. The final choice is typically madebased upon the occurrence of steric clashes, applicable

A protein segment's Cα traceFigure 1A protein segment's Cα trace. The Cα positions are num-bered, and the pseudo bond angles θ and pseudo dihedrals τ are indicated. The segment has length 5, and is thus fully described by two pseudo dihedral and three pseudo bond angles.

46



constraints (for example side chain conformations) andevaluation of the energy.

The CCD algorithm does not directly generate conforma-tions that bridge a given gap, but alters the dihedral anglesof a given starting segment that already overlaps at the N-terminus such that it also closes at the C-terminus. Thestarting segment can be generated in many ways, forexample by using a fragment library derived from realstructures or by constructing random artificial fragmentswith reasonable conformations. Surprisingly, most pro-tein loops can be closed efficiently by CCD starting fromartificial loops constructed with random (φ, ψ) dihedralangles [8].

The CCD algorithm alters the (φ, ψ) dihedral angles forevery residue in the segment in an iterative way. In eachstep, the RMSD between the chain end and the overlap isminimized by optimizing one dihedral angle. Becauseonly one dihedral angle is optimized at a time, the opti-mal rotation can be calculated efficiently using simplevector arithmetic.

The list of advantages of CCD is impressive: it is concep-tually simple and easy to implement, computationallyfast, very flexible (ie. capable of incorporating variousrestraints and/or constraints) and numerically stable.Therefore, we decided to adopt the CCD algorithm for usewith Cα-only models. Here, we describe a new version ofCCD that optimizes both dihedral angles and bondangles, while maintaining all the advantages of the CCDmethod. We call our method Full Cyclic CoordinateDescent (FCCD), where "Full" indicates that both dihe-dral angles and bond angles are optimized, while only thebond lengths remain fixed. At the heart of the FCCDmethod lies a procedure to superimpose point sets withminimal Root Mean Square Deviation (RMSD), based onsingular value decomposition. As is the case for the CCDalgorithm, FCCD is not a modelling method in itself.Rather, it can be used as a method to generate possibleconformations that can be evaluted using some kind ofenergy function.

To test the algorithm, we selected random segments froma protein structure database, and evaluated the efficiencyof closing the corresponding gaps starting from artificialsegments with protein-like (θ, τ) angles. We show thatFCCD is both fast and successful in solving the loop clo-sure problem, even in the presence of angle constraints.Conformations close to those of real protein loops arereadily generated. Finally, we discuss possible applica-tions of the FCCD algorithm, and mention some possibledisadvantages.

Results and discussionOverview of the FCCD algorithmFigure 2 illustrates the essence of the FCCD algorithm,and Table 3 provides detailed pseudo code. Here wedefine some of the terms that will be used throughout thearticle, and provide a high level overview of the FCCDalgorithm.

The fixed segment is a list of Cα vector positions that spec-ifies the gap that needs to be bridged. Only the first andlast three Cα positions, with corresponding vectors (f0, f1,f2) and (fN-3, fN-2, fN-1) are relevant. We will call these twosets of vectors the N- and C-terminal overlaps, respectively.The moving segment is a list of Cα position vectors that willbe manipulated by the FCCD algorithm to bridge the gap.The closed segment is the moving segment after its pseudobond angles and pseudo dihedral angles were adjusted tobridge the N- and C-terminal overlaps of the fixed seg-ment. The vectors describing the positions of the Cαatoms in a segment of N residues are labelled from 0 to N- 1.

The action of the FCCD algorithm in Cα spaceFigure 2The action of the FCCD algorithm in Cα space. The Cα traces of the moving, fixed and closed segments are shown in red, green and blue, respectively. The Cα atoms are repre-sented as spheres. The labels f0, f1 and f2 indicate the three fixed vectors at the N-terminus that are initially common between the fixed and moving segments. The loop is closed when the three C-terminal vectors of the moving segment (labelled mN-3, mN-2, mN-1) superimpose with an RMSD below the given threshold on the three C-terminal vectors of the fixed segment (labelled (fN-3, fN-2, fN-1). This figure and Figure 3 were made with PyMol http://www.pymol.org.

47



Initially, the first three vectors of the moving loop coin-cide with the first three vectors of the fixed segment, whilethe last three vectors are conceivably reasonably close tothe last three vectors of the fixed loop. This last conditionis however not very critical. The moving segment can beobtained using any algorithm that generates plausible Cαfragments, including deriving them from real proteinstructures. The fixed segment is typically derived from areal protein of interest, or a model in an MCMCsimulation.

The FCCD algorithm changes the pseudo bond angles andpseudo dihedral angles of the moving loop in such a waythat the RMSD between the last three vectors of the mov-ing loop (mN-3, mN-2, mN-1) and the last three vectors of thefixed loop (fN-3, fN-2, fN-1) is minimized, thereby seam-lessly closing the gap.

Note that we assume that the last three vectors of the mov-ing and fixed segments can be superimposed with anRMSD of 0.0 Å (see Figure 2). In other words, the first andlast pseudo bond angles in both segments are equal. It ishowever perfectly possible to use segments with differentpseudo bond angles at these positions. Since the finalpossible minimum RMSD will be obviously greater than0 in this case, the RMSD threshold needs to be adjustedaccordingly.

The algorithm proceeds in an iterative way. In each itera-tion, a vector mi in the moving segment is chosen that willserve as a center of rotation. This chosen center of rotationwill be called the pivot throughout this article. Then, therotation matrix that rotates (mN-3, mN-2, mN-1) on (fN-3, fN-

2, fN-1) around the pivot and resulting in minimum RMSDis determined, and applied to all the vectors mj down-stream i (with i <j <N). In the next iteration, a new pivotis chosen, and the procedure is repeated. The vectors inthe chain can be traversed linearly, or they can be chosenat random in each iteration. The difference between FCCDand CCD is that the latter applies a general rotation to thechain using an atom in the chain as a pivot, while theformer only applies a rotation around a single axis. Theprocess is stopped when the RMSD falls below a giventhreshold.

Finding the optimal (with respect to the RMSD) rotationmatrix corresponds to finding one optimal pseudo bondangle and pseudo dihedral angle pair. We define θi as thebond angle of the vectors mi-1, mi, mi+1 and τi as the dihe-dral angle of the vectors mi-2, mi-1, mi, mi+1 (see Figure 1and [16]). These definitions have the intuitive interpreta-tion that altering (θi, τi) changes the positions of all Cα'sdownstream from position i. Conversely, using pivot miand applying a rotation matrix to all the positions down-

stream from position i corresponds to changing pseudobond angle θi and pseudo dihedral angle τi.

For a segment of N Cα's (with N > 3), the pseudo anglesrange from θ1 to θN-2 and the pseudo dihedrals range fromτ2 to τN-2. Since the first and last bond angles of the mov-ing segment are fixed, the pivot points range from posi-tion 2 to position N - 3 (with N > 4). The pseudo bondangle and pseudo dihedral angle pairs thus range from(θ2, τ2) to (θN-3, τN-3).

Finding the optimal rotation matrix with respect to theRMSD of the C-terminal overlaps can be efficiently solvedusing singular value decomposition, as described in detailin the following section.

Finding the optimal rotationIn this section we discuss solving the following subprob-lem arising in the FCCD algorithm: given a chosen pivotpoint i in the moving segment, find the optimal (θi, τi)pair that minimizes the RMSD between the last three Cαvectors in the moving segment and the last three Cα vec-tors in the fixed segment. Recall that the (θi, τi) pair atposition i corresponds to the pseudo bond angles andpseudo dihedral angles defined by vectors mi-1, mi, mi+1and mi-2, mi-1, mi, mi+1 respectively.

Finding the optimal (θi, τi) pair simply corresponds tofinding the optimal rotation matrix using Cα position i asthe center of rotation (see Figure 2). This reformulatedproblem can be solved by a variant of a well known algo-rithm to superimpose two point sets with minimumRMSD which makes use of singular value decomposition[20,21]. Below, we describe this adapted version of thealgorithm.

First, the C-terminal overlaps of the moving and the fixedsegment need to be translated to the new origin that willbe used as pivot for the optimal rotation. This new originis the pivot vector mi at Cα position i in the moving seg-ment. The new vector coordinates of the moving and thefixed segments are put in two matrices (respectively M andF), with the coordinates of the vectors positioned columnwise:

M = [mN-3 - mi | mN-2 - mi | mN-1 - mi]

F = [fN-3 - mi | fN-2 - mi | fN-1 - mi]

Then, the correlation matrix Σ is calculated using M and F:

Σ = FMT

48



Any real n × m matrix A can be written as the product ofan orthogonal n × n matrix U, a diagonal n × m matrix Dand an orthogonal m × m matrix VT [22]. Such a factoriza-tion is called a singular value decomposition of A. The posi-tive diagonal elements of D are called the singular values.Hence, Σ can be written as:

Σ = UDVT

The optimal rotation Γ is then calculated as follows:

Γ = USVT

The value of the diagonal 3 × 3 matrix S is determined bythe product det(U)det(VT), which is either 1 or -1. If thisproduct is -1 then S = diag(1, 1, -1), else S is the 3 × 3 unitmatrix. The matrix S ensures that Γ is always a pure rota-tion, and not a rotation-inversion [21].

In order to apply to all the vectors that are downstreamfrom the pivot point i, these vectors are first translated tothe origin of the rotation (ie. pivot point mi), left multi-plied by Γ and finally translated back to the originalorigin:

where i <j <N.

Adding angle constraints to FCCDIt is straightforward to constrain the (θ, τ) angles to agiven probability distribution. For each rotation matrix Γ,the resulting new pseudo bond angles and dihedral anglescan easily be calculated. The new angles can for examplebe accepted or rejected using a simple rejection samplingMonte Carlo scheme, comparing the probabilities of theprevious pair (θprev, τprev) with that of the next pair (θnext,τnext). If P (θnext, τnext) > P (θprev, τprev) the change isaccepted, otherwise it is accepted with a chance propor-tional to P (θnext, τnext) / P (θprev, τprev). A similar approachwas used by Canutescu & Dunbrack [8], and we describe

its performance in combination with FCCD in the follow-ing section.

More advanced methods could take the probability of thesequence of angles into account as well, for example usinga Hidden Markov Model of the backbone [23]. Thepseudo code in Table 3 illustrates accepting/rejecting rota-tions using an unspecified 'accept' function, whose detailswill depend on the application.

FCCD's performanceIn order to evaluate the general efficiency of the method,we selected random fragments of various sizes from a rep-resentative database of protein structures, and used thesefragments as fixed segments. Hence, the evaluationdescribed below is not limited to loops, but extends torandom protein segments. This is a relevant test, sincelocal moves in a typical MCMC simulation are indeed per-formed on random segments.

The fixed segments were sampled from a dataset of foldrepresentatives (see Methods). First we selected a randomfold representative, and subsequently extracted a randomcontinuous fragment of suitable length. The lengths var-ied from 10 to 30 with a step size of 5. It should be notedthat the length of the segment here refers to the number ofCα atoms between the ends that need to be bridged.

The moving segments were generated using random dihe-dral and bond angles in regions accessible to proteins (seeprevious section). This was done by sampling the (θi, τi)pairs according to a probability distribution derived froma set of representative protein structures (see Methods).The bond length was fixed at 3.8 Å, in tune with the con-sensus Cα-Cα distance in protein structures. The last bondangle in the moving segment was chosen equal to the lastbond angle in the fixed loop to make a final RMSD of 0.0Å possible. The RMSD threshold was 0.1 Å. The maximumnumber of iterations was set to 1000, where one iterationis a sweep over all positions. We ran the FCCD programon 1000 different fixed segments. Table 1 summarizes theresults.

m m m mj j i iΓ Γ= − +( )

Table 1: Performance of the FCCD algorithm for various segment lengths. The first and second number in columns 2–4 refer to unconstrained and constrained FCCD, respectively. Columns 2 and 3 respectively show the average time and number of iterations needed for closing a single segment successfully. The percentage of loops successfully closed in under 1000 iterations is shown in the last column.

Segment length Average time (ms) Average iterations % Closed

5 4.5/51.7 14.0/27.0 99.90/86.5010 5.2/28.3 10.5/16.8 99.40/98.2015 5.6/28.6 7.8/12.1 99.60/99.4020 6.2/27.1 6.3/9.0 99.80/99.4025 7.6/31.7 5.5/7.6 99.00/99.9030 7.1/31.0 4.4/6.3 99.70/99.40

49



A first observation is the effect of the angle constraints.These slow down FCCD with a factor of 10 for small seg-ments (5 residues) and roughly a factor of 5 for larger seg-ments (10 residues or more). Nonetheless FCCDincluding constraints remains quite speed efficient: smallfive residue segments are on average closed in about 50ms, while larger segments (from 10 to 30 residues) areclosed considerably faster (on average in about 30 ms).The explanation for this is of course that it is easier to closelarge segments because they have more DOF. Hence,FCCD, like CCD, is fast and easily handles large segmentsefficiently.

Overall, the success rate of FCCD is excellent, and very lit-tle affected by constraints. For 5 residue segments, addingconstraints diminishes the number of successfully closedsegments from 99.9% to 86.5%. This effect is howevermuch less pronounced for larger segments: more than98% percent of the moving/fixed segment pairs can besuccessfully closed. In short, FCCD is both speed efficientand has a high success rate, even in the presence ofconstraints.

Evaluation of FCCD's sampling spaceDoes FCCD potentially generate realistic protein confor-mations? FCCD could be used to propose possible confor-mations that are subsequently evaluated by an energyfunction. In this context, it is of course imperative to gen-erate realistic conformations. To answer this question, weevaluate FCCD's ability to generate closed segments thatare close to real protein loops. We used 30 real loops withlengths of 4, 8 and 12 residues as fixed segments. The looplength refers to the number of residues between the N-and C-terminal overlaps.

FCCD was applied using (θ, τ) constraints and an RMSDthreshold of 0.1 Å. The maximum number of iterationswas set to 1000. For each loop, we attempted to generateclosed segments from 1000 random moving segmentswithin the allowed number of iterations. The moving seg-ments were generated as described in the previous section.For all 30 loop cases, we then identified the closed seg-ment that resembled the input loop best as judged by theRMSD. For the calculation of the RMSD, we included theN-and C-terminal overlaps. The results are shown in Table2, and the best fitting loops for each loop size are shownin Figure 3.

It is clear that FCCD readily generates closed segmentsthat are reasonably close to the real loops, with an averageRMSD of about 0.6, 2.2 and 3.0 Å for loops of 4, 8 and 12residues, respectively. The highest minimum RMSDvalues for these loop lengths are 0.76, 2.42 and 3.37 Å,respectively, indicating that FCCD in general can come upwith a reasonably close conformation. Using more initial

moving segments will obviously increase the chance ofencountering a close conformation. Additionally, one canalso expect an even better performance with a morerefined way to constrain the (θ, τ) angles.

ConclusionIn this article, we introduce an algorithm that solves theloop closure problem for Cα only protein models. Themethod is conceptually similar to the CCD loop closuremethod introduced by Canutescu and Dunbrack [8], butoptimizes dihedral and bond angles simultaneously,while the former method only optimizes one angle at atime. At the heart of the method lies a modified algorithmto superimpose point sets with minimum RMSD, basedon singular value decomposition [20,21].

The algorithm is fast, numerically stable and leads to asolution for the great majority of loop closure problemsstudied here. Importantly, the method remains efficienteven in the presence of constraints on the dihedral andbond angles. FCCD readily handles large gaps, and poten-tially generates realistic conformations. Compared toother loop closure methods, FCCD is surprisingly easy toimplement provided a function is available to calculatethe singular value decomposition of a matrix.

A possible disadvantage is that FCCD has a tendency toinduce large changes to the pseudo angles at the start ofthe moving segment while angles near the end are lessaffected, which is also the case for CCD [8]. This can forexample be avoided by selecting the pivot points in a ran-dom fashion, or by limiting the allowed change in theangles per iteration. Occasionally the method gets stuck,which can be avoided by incorporating stochastic changesaway from the encountered local minimum. One can alsosimply try again with a new random moving segment. Webelieve that CCD and FCCD despite these disadvantagesare among the most efficient loop closure algorithms cur-rently available.

The FCCD algorithm proposed here has great potential foruse in structure prediction methods that only make use ofCα atoms, or that otherwise do not include all backboneatoms [15,13,14]. FCCD could be used for example toimplement local moves in a MCMC procedure. The mov-ing segments could be derived from a fragment databaseor generated from a probabilistic model of the proteinbackbone. The latter model could range from a primitiveprobability distribution over allowed (θ, τ) angle pairslike we used here to a Hidden Markov Model that alsomodels the sequence of (θ, τ) angle pairs.

We are planning to use the FCCD algorithm in combina-tion with a sophisticated probabilistic model of the pro-tein's backbone, which will steer both the generation of

50



Loops generated by FCCD (blue) that are close to real protein loops (green)Figure 3Loops generated by FCCD (blue) that are close to real protein loops (green). The loops with lowest RMSD to a given loop of length 4 (top), 8 and 12 (bottom) are shown (loops 1qnr, A, 195–198, 3chb, D, 51–58 and 1ctq, A, 26–37). The N- terminus is at the left hand side.

51



Table 2: Minimum RMSD (out of 1000 tries) between a fixed segment derived from a protein structure and a closed segment generated by FCCD. The length of the loops is shown between parentheses in the upper row.

Loop (4) RMSD Loop (8) RMSD Loop (12) RMSD

1dvj, A, 20–23 0.59 1cru, A, 85–92 2.31 1cru, A, 358–369 3.371dys, A, 47–50 0.67 1ctq, A, 144–151 2.22 1ctq, A, 26–37 2.401egu, A, 404–407 0.61 1d8w, A, 334–341 2.04 1d4o, A, 88–99 3.201ej0, A, 74–77 0.61 1ds1, A, 20–27 2.20 1d8w, A, 43–54 2.741i0h, A, 123–126 0.73 1gk8, A, 122–129 2.20 1ds1, A, 282–293 3.161id0, A, 405–408 0.66 1i0h, A, 145–152 2.42 1dys, A, 291–302 2.901qnr, A, 195–198 0.54 1ixh, 106–113 1.98 1egu, A, 508–519 3.061qop, A, 44–47 0.58 1lam, 420–427 2.16 1f74, A, 11–22 3.121tca, 95–98 0.76 1qop, B, 14–21 2.17 1q1w, A, 31–42 3.041thf, D, 121–124 0.56 3chb, D, 51–58 1.97 1qop, A, 175–186 2.97

Average RMSD 0.63 Average RMSD 2.17 Average RMSD 3.00

Table 3

maxit = maximum number of iterationsmoving = N × 3 matrix of Cα positions in moving segmentfixed = N × 3 matrix of Cα positions in fixed segmentthreshold = desired minimum RMSDN = length of the segmentsM = 3 × 3 matrix (centered coordinates along columns)F = 3 × 3 matrix (centered coordinates along columns)S = diag(1, 1, -1)repeat maxit:

# Start iteration over pivotsfor i from 2 to N-3:

pivot = moving[i,:]# Make pivot point originfor j from 0 to 2:

M [:,j] = moving [N-3+j,:]-pivotF [:,j] = fixed [N-3+j,:]-pivot

# Find the rotation Γ that minimizes RMSDΣ = FMT

U, D, VT = svd(Σ)# Check for reflectionif det(U)det(VT)<0:

U = USΓ = UVT

# Evaluate and apply rotationif accept(Γ):

# Apply the rotation to the moving segmentfor j from i+1 to N-1:

moving [j,:] = Γ (moving [j,:]-pivot)+pivotrmsd = calc_rmsd(moving [N-3,:], fixed [N-3,:])# Stop if RMSD below thresholdif rmsd<threshold:

return moving, rmsd# Failed: RMSD threshold not reached before maxitreturn 0

The accept function rejects or accepts the proposed rotation, based on the resulting (θ, τ) pair. The svd function performs singular value decomposition, and calc_rmsd calculates the RMSD between two lists of vectors.

52



the initial moving loop and the acceptance/rejection ofthe angles. The performance of FCCD in this context willbe the subject of a future publication.

MethodsImplementationThe FCCD algorithm was implemented in C, using theLAPACK [24] function dgesvd for the calculation of thesingular value decomposition. Handling PDB files andcalculating the (θ, τ) angles [16] was done using Biopy-thon's Bio.PDB module [25]. We used a 2.5 GHz Pentiumprocessor to calculate the benchmarks. A reference imple-mentation of FCCD in Python is available as supplemen-tary information.

Structure databasesFor the calculation of the (θ, τ) probability distributionand the generation of random protein fragments, we usedthe SABMark 1.63 Twilight Zone database [26]. SABMarkTwilight Zone contains 2230 high quality protein struc-tures, divided over 236 different folds. All protein pairshave a BLAST E-value below 1, and thus presumablybelong to different superfamilies. A dataset of foldrepresentatives was generated by selecting a single struc-ture at random for each fold (see Table 4).

The loops used to evaluate FCCD's sampling space werederived from Canutescu & Dunbrack [8]. We shifted twoloops (1d8w, A, 46–57 and 1qop, A, 178–189) by three

residues to ensure that all loops had three flanking resi-dues on each side.

Calculation of the (θ, τ) probability distributionThe bond angle θ was subdivided in 18 bins and the dihe-dral angle τ in 36 bins, in both cases starting at 0 degreesand with a bin width of 10 degrees. All (θ, τ) angles wereextracted from all structures in the SABMark TwilightZone database that consisted of a polypeptide chain with-out breaks. In total, 257534 angle pairs were extracted.Each such (θ, τ) angle pair was assigned to a bin pair, andthe number of angle pairs assigned to each bin pair wasstored in a 18 × 36 count matrix. Finally, the normalizedcount matrix was used to assign a probability to any given(θ, τ) angle pair.

List of abbreviations• CCD: Cyclic Coordinate Descent

• DOF: Degrees Of Freedom

• FCCD: Full Cyclic Coordinate Descent

• MCMC: Markov Chain Monte Carlo

• RMSD: Root Mean Square Deviation

Table 4: SABMark identifiers of the 236 structures used as fold representatives

1ew6a_ 1ail__ 1l1la_ 1kid__ 1n8yc1 1gzhb1 1e5da1 1ep3b2 1ihoa_ 1m0wa11dhs__ 1gpua2 2lefa_ 1nsta_ 1eaf__ 1iiba_ 1d5ra2 1foha3 1gpua3 1crza23pvia_ 1i6pa_ 1e4ft1 1kx5d_ 2pth__ 1lu9a2 1dkla_ 1fsga_ 1m2oa3 2dpma_1ajsa_ 1fxoa_ 3tgl__ 1bx4a_ 1mtyg_ 1duvg2 1qopb_ 1iata_ 1k2yx2 1f0ka_1ayl_1 1toaa_ 8abp__ 1nh8a1 1bi5a2 2mhr__ 1a2pa_ 3lzt__ 1dkia_ 1e7la21bf4a_ 1bb8__ 1kpf__ 1mu5a2 1lfda_ 1gpea2 1jqca_ 1a2va2 1jfma_ 1ll7a21cjxa1 1lo7a_ 1fm0e_ 1fs1b2 1o0wa2 1dtja_ 1k0ra3 1evsa_ 1jpdx2 1qd1a11d5ya3 1h3fa2 1iq0a3 1tig__ 1xxaa_ 1ck9a_ 1gyxa_ 1e5qa2 1ivsa2 1qbea_3grs_3 1f08a_ 1c7ka_ 1lkka_ 1dq3a3 1uox_1 12asa_ 1bob__ 1m4ja_ 1dv5a_1f5ma_ 1k2ea_ 1ei1a2 1jdw__ 1ln1a_ 2pola2 1f0ia1 1rl6a1 1fvia2 1j7la_1is2a1 1e8ga2 1qr0a1 2dnja_ 1kuua_ 1qh5a_ 1ii7a_ 1b8pa2 1j7na3 1chua31f00i3 1grj_1 1nkd__ 1mwxa3 1jp4a_ 1ih7a2 1eula2 1gnla_ 1maz__ 2por__4htci_ 1es7b_ 1tocr1 1d1la_ 1fd3a_ 1i8na_ 1h8pa1 4sgbi_ 1fltv_ 1quba11d4va3 1tpg_2 1iuaa_ 1fv5a_ 1mdya_ 1zmec1 1fjgn_ 1eska_ 1i50i2 1fbva41dmc__ 1e53a_ 1ezvb1 1jeqa1 1k3ea_ 1rec__ 1lm5a_ 1k82a1 1jaja_ 1m0ka_1c0va_ 1kqfc_ 1ocrk_ 1h67a_ 2cpga_ 1ljra1 1brwa1 1hs7a_ 2cbla2 1jmxa21hyp__ 1cuk_2 1ecwa_ 1l9la_ 1g7da_ 1jkw_1 1dgna_ 1iqpa1 1pa2a_ 1ko9a11f1za1 1ks9a1 2sqca2 1d2ta_ 1h3la_ 1wer__ 1b3ua_ 1n1ba2 1poc__ 1e79i_1m1qa_ 1enwa_ 1g4ma1 1e5ba_ 1qhoa2 1kv7a2 1l4ia2 1c8da_ 1amm_1 1ca1_21phm_2 1d7pm_ 1jjcb2 1flca1 1gr3a_ 1mjsa_ 1a8d_1 1lf6a2 1fqta_ 1jb0e_1jh2a_ 1lcya1 1mgqa_ 1hcia1 1b3qa2 1jlxa1 1dar_1 1exma2 1ejea_ 1agja_1e79d2 2rspa_ 1h0ha1 1gtra1 2erl__ 1btn__ 1lf7a_ 1jmxa5 1crua_ 1m1xa41hx0a1 1goia1 1ciy_2 1daba_ 3tdt__ 1gg3a1 1pmi__ 1bdo__ 1h3ia2 1gppa_1f39a_ 1k6wa1 1jqna_ 1lu9a1 1m6ia1 1o94a3

53

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral



Authors' contributionsTH conceived the FCCD algorithm. WB implementedFCCD in the C language, and introduced various refine-ments and optimizations. Both authors read andapproved the article.

Additional material

AcknowledgementsWouter Boomsma is supported by the Lundbeckfond http://www.lund beckfonden.dk/. Thomas Hamelryck is supported by a Marie Curie Intra-European Fellowship within the 6th European Community Framework Pro-gramme. We acknowledge encouragement and support from Prof. Anders Krogh, Bioinformatics Center, Institute of Molecular Biology and Physiol-ogy, University of Copenhagen.

References1. Go N, Scheraga H: Ring closure and local conformational

deformations of chain molecules. Macromolecules 1970,3:178-187.

2. Koehl P, Delarue M: A self consistent mean field approach tosimultaneous gap closure and side-chain positioning inhomology modelling. Nat Struct Biol 1995, 2:163-70.

3. da Silva R, Degreve L, Caliri A: LMProt: an efficient algorithm forMonte Carlo sampling of protein conformational space. Bio-phys J 2004, 87:1567-77.

4. Jones T, Thirup S: Using known substructures in protein modelbuilding and crystallography. EMBO J 1986, 5:819-22.

5. Rohl C, Strauss C, Chivian D, Baker D: Modeling structurally var-iable regions in homologous proteins with rosetta. Proteins2004, 55:656-77.

6. Kolodny R, Guibas L, Levitt M, Koehl P: Inverse kinematics in biol-ogy: The protein loop closure problem. Int J Robotics Research2005, 24:151-163.

7. Manocha D, Canny J: Efficient inverse kinematics for general 6Rmanipulators. IEEE Trans Rob Aut 1994, 10:648-657.

8. Canutescu A, Dunbrack R Jr: Cyclic coordinate descent: Arobotics algorithm for protein loop closure. Protein Sci 2003,12:963-72.

9. Coutsias E, Seok C, Jacobson M, Dill K: A kinematic view of loopclosure. J Comput Chem 2004, 25:510-28.

10. Favrin G, Irbäck A, Sjunnesson F: Monte carlo update for chainmolecules: Biased Gaussian steps in torsional space. J ChemPhys 2001, 114:8154-8158.

11. Cahill M, Cahill S, Cahill K: Proteins wriggle. Biophys J 2002,82:2665-70.

12. Singh R, Bergert B: Chaintweak: sampling from the neighbour-hood of a protein conformation. Pac Symp Biocomput 2005 [http://helix-web.stanford.edu/ps605/singh.pdf].

13. Buchete N, Straub J, Thirumalai D: Development of novel statis-tical potentials for protein fold recognition. Curr Opin Struct Biol2004, 14:225-32.

14. Tozzini V: Coarse-grained models for proteins. Curr Opin StructBiol 2005, 15:144-50.

15. Kihara D, Lu H, Kolinski A, Skolnick J: TOUCHSTONE: an ab ini-tio protein structure prediction method that uses threading-based tertiary restraints. Proc Natl Acad Sci USA 2001,98:10125-30.

16. Oldfield T, Hubbard R: Analysis of C alpha geometry in proteinstructures. Proteins 1994, 18:324-37.

17. Fidelis K, Stern P, Bacon D, Moult J: Comparison of systematicsearch and database methods for constructing segments ofprotein structure. Protein Eng 1994, 7:953-60.

18. van Vlijmen H, Karplus M: PDB-based protein loop prediction:parameters for selection and methods for optimization. JMol Biol 1997, 267:975-1001.

19. Wang L, Chen C: A combined optimization method for solvingthe inverse kinematics problem of mechanical manipulators.IEEE Trans Rob Aut 1991, 7:489-499.

20. Kabsch W: A discussion of the solution for the best rotationto relate two sets of vectors. Acta Cryst 1978, A34:827-828.

21. Umeyama S: Least squares estimation of transformationparameters between two point patterns. IEEE Trans PatternAnal Mach Intell 1991, 13:376-80.

22. Golub GH, Loan CFV: Matrix Computations 3rd edition. Baltimore,Maryland: Johns Hopkins University Press; 1996.

23. Bystroff C, Thorsson V, Baker D: HMMSTR: a hidden Markovmodel for local sequence-structure correlations in proteins.J Mol Biol 2000, 301:173-90.

24. Anderson E, Bai Z, Bischof C, Demmel J, Dongarra J, Croz JD, Green-baum A, Hammarling S, McKenney A, Ostrouchov S, Sorensen D:LAPACK's user's guide 1992 [http://www.netlib.org/lapack/lug/]. Phila-delphia, PA, USA: Society for Industrial and Applied Mathematics

25. Hamelryck T, Manderick B: PDB file parser and structure classimplemented in Python. Bioinformatics 2003, 19:2308-10.

26. Van Walle I, Lasters I, Wyns L: SABmark-a benchmark forsequence alignment that covers the entire known fold space.Bioinformatics 2005, 21:1267-8.

Additional File 1The file FCCD.py contains an implementation of the FCCD algorithm. The program was implemented in the interpreted, object oriented lan-guage Python http://www.python.org. The Numeric Python package http://numeric.scipy.org/, a Python module that implements many advanced mathematical operations efficiently in C and FORTRAN, provided imple-mentations of singular value decomposition and various matrix opera-tions. In addition, the Biopython toolkit, a set of Bioinformatics modules implemented in Python, was used to represent atomic coordinates as vector objects [25]. The core of the FCCD implementation comprises only 50 lines of Python code. Numeric Python and Biopython (version 1.4b) are needed to execute the sample code.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-159-S1.py]

54

Chapter 2

A generative, probabilistic model of local proteinstructure

Wouter Boomsma, Kanti V. Mardia, Charles C. Taylor, Jesper Ferkinghoff-Borg, Anders Krogh, Thomas Hamelryck

The Proceedings of the National Academy of Sciences, USA 105:8932–8937. 2008.

55

A generative, probabilistic model of localprotein structureWouter Boomsma*, Kanti V. Mardia†, Charles C. Taylor†, Jesper Ferkinghoff-Borg‡, Anders Krogh*,and Thomas Hamelryck*§

*Bioinformatics Centre, Department of Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark; †Department of Statistics,University of Leeds, Leeds, West Yorkshire LS2 9JT, United Kingdom; and ‡DTU Elektro, Technical University of Denmark, 2800 Lyngby, Denmark

Edited by David Baker, University of Washington, Seattle, WA, and approved March 14, 2008 (received for review February 27, 2008)

Despite significant progress in recent years, protein structure predic-tion maintains its status as one of the prime unsolved problems incomputational biology. One of the key remaining challenges is anefficient probabilistic exploration of the structural space that correctlyreflects the relative conformational stabilities. Here, we present afully probabilistic, continuous model of local protein structure inatomic detail. The generative model makes efficient conformationalsampling possible and provides a framework for the rigorous analysisof local sequence–structure correlations in the native state. Ourmethod represents a significant theoretical and practical improve-ment over the widely used fragment assembly technique by avoidingthe drawbacks associated with a discrete and nonprobabilisticapproach.

conformational sampling � directional statistics � probabilistic model �TorusDBN � Bayesian network

Protein structure prediction remains one of the greatest chal-lenges in computational biology. The problem itself is easily

posed: predict the three-dimensional structure of a protein given itsamino acid sequence. Significant progress has been made in the lastdecade, and, especially, knowledge-based methods are becomingincreasingly accurate in predicting structures of small globularproteins (1). In such methods, an explicit treatment of localstructure has proven to be an important ingredient. The searchthrough conformational space can be greatly simplified through therestriction of the angular degrees of freedom in the protein back-bone by allowing only angles that are known to appear in the nativestructures of real proteins. In practice, the angular preferences aretypically enforced by using a technique called fragment assembly.The idea is to select a set of small structural fragments with strongsequence–structure relationships from the database of solved struc-tures and subsequently assemble these building blocks to formcomplete structures. Although the idea was originally conceived incrystallography (2), it had a great impact on the protein structure-prediction field when it was first introduced a decade ago (3).Today, fragment assembly stands as one of the most importantsingle steps forward in tertiary structure prediction, contributingsignificantly to the progress we have seen in this field in recentyears (4, 5).

Despite their success, fragment-assembly approaches generallylack a proper statistical foundation, or equivalently, a consistent wayto evaluate their contributions to the global free energy. When afragment-assembly method is used, structure prediction normallyproceeds by a Markov Chain Monte Carlo (MCMC) algorithm,where candidate structures are proposed by the fragment assemblerand then accepted or rejected based on an energy function. Thetheoretical basis of MCMC is the existence of a stationary proba-bility distribution dictating the transition probabilities of theMarkov chain. In the context of statistical physics, this stationarydistribution is given by the conformational free energy through theBoltzmann distribution. The problem with fragment-assemblymethods is that it is not possible to evaluate the proposal probabilityof a given structure, which makes it difficult to ensure an unbiasedsampling (which requires the property of detailed balance). Local

free energies could, in principle, be assigned to individual frag-ments, but there is no systematic way to combine them into a localfree energy for an assembly of fragments. In fact, because of edgeeffects, the assembly process often introduces spurious localstructural motifs that are not themselves present in the fragmentlibrary (3).

Significant progress has been made in the probabilistic modelingof local protein structure. With HMMSTR, Bystroff and coworkers(6) introduced a method to turn a fragment library into a proba-bilistic model but used a discretization of angular space, therebysacrificing geometric detail. Other studies focused on strictly geo-metric models (7, 8). For these methods, the prime obstacle is theirinability to condition the sampling on a given amino acid sequence.In general, it seems that none of these models has been sufficientlydetailed or accurate to constitute a competitive alternative tofragment assembly. This is reflected in the latest CASP (criticalassessment of techniques for protein structure prediction) exercise,where the majority of best performing de novo methods continueto rely on fragment assembly for local structure modeling (5).

Recently, we showed that a first-order Markov model forms anefficient probabilistic, generative model of the C� geometry ofproteins in continuous space (9). Although this model allowssampling of C� traces, it is of limited use in high-resolution de novostructure prediction, because this requires the representation of thefull atomic detail of a protein’s backbone, and the mapping from C�to backbone geometry is one-to-many. Consequently, this modelalso cannot be considered a direct alternative to the fragment-assembly technique.

In the present study, we propose a continuous probabilistic modelof the local sequence–structure preferences of proteins in atomicdetail. The backbone of a protein can be represented by a sequenceof dihedral angle pairs, � and � (Fig. 1) that are well known fromthe Ramachandran plot (10). Two angles, both with values rangingfrom �180° to 180°, define a point on the torus. Hence, thebackbone structure of a protein can be fully parameterized as asequence of such points. We use this insight to model the angularpreferences in their natural space using a probability distribution onthe torus and thereby avoid the traditional discretization of anglesthat characterizes many other models. The sequential dependenciesalong the chain are captured by using a dynamic Bayesian network(a generalization of a hidden Markov model), which emits anglepairs, amino acid labels, and secondary structure labels. This allowsus to directly sample structures compatible with a given sequenceand resample parts of a structure while maintaining consistency

Author contributions: W.B. and T.H. designed research; W.B. performed research; K.V.M.and C.C.T. contributed new reagents/analytic tools; and W.B., J.F.-B., A.K., and T.H. wrotethe paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Freely available online through the PNAS open access option.

§To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/cgi/content/full/0801715105/DCSupplemental.

© 2008 by The National Academy of Sciences of the USA

8932–8937 � PNAS � July 1, 2008 � vol. 105 � no. 26 www.pnas.org�cgi�doi�10.1073�pnas.0801715105

57

along the entire chain. In addition, the model makes it possible toevaluate the likelihood of any given structure. Generally, thesampled structures will not be globular, but will have realistic localstructure, and the model can thus be used as a proposal distributionin structure-prediction simulations. The probabilistic and genera-tive nature of the model also makes it directly applicable in theframework of statistical mechanics. In particular, because theprobability before and after any resampling can be evaluated,unbiased sampling can be ensured.

We show that the proposed model accurately captures theangular preferences of protein backbones and successfully repro-duces previously identified structural motifs. Finally, through acomparison with one of the leading fragment-assembly methods,we demonstrate that our model is highly accurate and efficient, andwe conclude that our approach represents an attractive alternativeto the use of fragment libraries in de novo protein-structureprediction.

Results and DiscussionTorusDBN—A Model of Protein Local Structure. Considering only thebackbone, each residue in a protein chain can be represented byusing two angular degrees of freedom, the � and � dihedral bond

angles (Fig. 1). The bond lengths and all remaining angles can beassumed to have fixed values (11). Even with this simple represen-tation, the conformational search space is extremely large. How-ever, as Ramachandran and coworkers (10) noted in 1963, not allvalues of � and � are equally frequent, and many combinations arenever observed because of steric constraints. In addition, strongsequential dependencies exist between the angle pairs along thechain. We define it as our goal to model precisely these localpreferences.

We begin by stating a few necessary conditions for the model.First, we require that, given an amino acid sequence, our modelshould produce protein backbone chains with plausible local struc-ture. In particular, the parameterization used in our model shouldbe sufficiently accurate to allow direct sampling and the construc-tion of complete protein backbones. Note that we do not expectsampled structures to be correctly folded globular proteins—weonly require them to have realistic local structure. Secondly, itshould be possible to seamlessly replace any stretch of a proteinbackbone with an alternative segment, thus making a small step inconformational space. Finally, we require that it is possible tocompare the probability of a newly sampled candidate segment withthe probability of the original segment, which is needed to enforcethe property of detailed balance in MCMC simulations.

The resulting model is presented in Fig. 2. Formulated as adynamic Bayesian network (DBN), it is a probabilistic model thatensures sequential dependencies through a sequence of hiddennodes. A hidden node represents a residue at a specific position ina protein chain. It is a discrete node that can adopt 55 states (seeMethods). Each of these states, or h values, corresponds to a distinctemission distribution over dihedral angles [d � (�,�)], amino acids(a), secondary structure (s), and the cis or trans conformation of thepeptide bond (c). The angular emissions are modeled by bivariatevon Mises distributions, whereas the � dihedral angle (Fig. 1) isfixed at either 180° or 0°, depending on the trans/cis flag. Note thatthis model can also be regarded as a hidden Markov model withmultiple outputs.

The joint probability of the model is a sum over each possible

C

O

C

N

Cα

Cβ

φ

ω

ψN

Fig. 1. The �, � angular degrees of freedom in one residue of the proteinbackbone. The � dihedral angle can be assumed to be fixed at 180° (trans) or0° (cis).

hidden n o de 55 st a t es

20 1 2 55

angle pair ( φ,ψ )

acid amino

st r u c tu r e se c onda r y

c on f o r m a tion t r ans/cis

(-69.4 , 156.1) P r o C oil T r ans

55 x 55 t r ansition m a t r ix

50 40

30 20

10

A E H L P S W

50 40

30 20

10

H E C

50 40

30 20

10

T C

17

39

11 47

2

37

10

20

Fig. 2. The TorusDBN model. The circular nodes represent stochastic variables, whereas the rectangular boxes along the arrows illustrate the nature of theconditional probability distribution between them. The lack of an arrow between two nodes denotes that they are conditionally independent. A hidden nodeemits angle pairs, amino acid information, secondary structure labels (H, helix; E, strand; C, coil) and cis/trans information. One arbitrary hidden node value ishighlighted in red and demonstrates how the hidden node value controls which mixture component is chosen.

Boomsma et al. PNAS � July 1, 2008 � vol. 105 � no. 26 � 8933

BIO

PHYS

ICS

58

hidden node sequence h � {h1, . . . , hN}, where N denotes thelength of the protein:

P�d, a, s, c� � �h

P�d�h�P�a�h�P�s�h�P�c�h�P�h� �

�h

�i

P�di�hi�P�ai�hi�P�s i�hi�P�ci�hi�P�hi�hi�1� .

The four types of emission nodes (d, a, s, and c) can each be usedeither as input or output. In most cases, some input information isavailable (e.g., the amino acid sequence), and the correspondingemission nodes are subsequently fixed to specific values. Thesenodes are referred to as observed nodes. Sampling from the modelthen involves two steps: (i) sampling a hidden node sequenceconditioned on the set of observed nodes and (ii) sampling emissionvalues for the unobserved nodes conditioned on the hidden-nodesequence. The first step is most efficiently solved by using theforward–backtrack algorithm (12, 9) [see supporting information(SI) Text], which allows for the resampling of any segment of achain. This resembles fragment insertion in fragment assembly-based methods, but the forward–backtrack approach has the ad-vantage that it ensures a seamless resampling that correctly handlesthe transitions at the ends of the segment. Once a particularsequence of hidden node values has been obtained, emission valuesfor the unobserved nodes are drawn from the corresponding

conditional probability distributions (step ii). This is illustrated inFig. 2, where the emission probability distributions for a particularh value are highlighted.

The parameters of the model were estimated from the SABmark1.65 (13) dataset (see Methods). From the 1,723 proteins, 276 wereexcluded during training and used for testing purposes (test set).

We conducted a series of experiments to evaluate the model’sperformance. Throughout this article, we will be comparing theresults obtained with our model (TorusDBN) to the resultsachieved with one of the most successful fragment assembly-basedmethods currently available, the Rosetta fragment assembler (3).Because our interest in this study is limited to modeling local proteinstructure, we exclusively enabled Rosetta’s initial fragment-assembly phase, disabling any energy evaluations apart from clashdetection. In all cases, as input to Rosetta, we used the amino acidsequence of the query structure, multiple sequence informationfrom PSI-BLAST (14), and a predicted secondary structure se-quence using PSIPRED (15).

Angular Preferences. As a standard quality check of protein struc-ture, a Ramachandran plot is often used by crystallographers todetect possible angular outliers. We investigated how closely theRamachandran plot of samples from our model matched theRamachandran plot for the corresponding native structures.

For each protein in the test set, we extracted the amino acidsequence, and calculated a predicted secondary structure labelingusing PSIPRED. We then sampled a single structure using thesequence and secondary structure labels as input and summarizedthe sampled angle pairs in a 2D histogram. Fig. 3 shows thehistograms for the test set and the samples, respectively. The resultsare strikingly similar. Although the experiment reveals little aboutthe detailed sequence–structure signal in our model, it provides afirst indication that a mixture of bivariate von Mises distributions isan appropriate choice to model the angular preferences of theRamachandran plot.

We proceeded with a comparison to Rosetta. For each proteinin the test set, we created a single structure using Rosetta’s fragmentassembler and compared the resulting histogram to that of the testset. Also in this case, the produced plot is visually indistinguishablefrom the native one (plot not shown). However, by using theKullback–Leibler (KL) divergence, a standard measure of distancebetween probability distributions, it becomes clear that the Ram-achandran plot produced by the TorusDBN is closer to native thanthe plot produced by Rosetta (see SI Text and Table S1).

φ

ψ

−180 −90 0 90 180

−180

−90

0

90

180

Density ( 1e−04 )

0 1 2 3 4

φ

ψ

−180 −90 0 90 180

−180

−90

0

90

180

Density ( 1e−04 )

0 1 2 3 4

Fig. 3. Ramachandran plots displaying the distribution of the 42,654 anglepairs in the test set (Left), and an equal number of angle pairs from sampledproteins from the model (Right).

0810

081− 4 55 28 9

C(100%) C(97%) H(100%) H(100%)H

SS

N capping box

φ

ψ

53 27 18 4H(100%) C(83%) C(99%) C(100%)

HSS

Schellman C cap

φψ

33 40 20 29H(100%) C(98%) C(99%) C(100%)

HSS

Proline C cap

φψ

15 21 8 18C(100%) C(100%) C(100%) C(99%)

HSS

β−turn type I

φ

ψ

0810

0 81− 20 16 13 4

C(99%) C(100%) C(99%) C(100%)H

SS

β−turn type II

φ

ψ

20 12 4 20C(99%) C(98%) C(100%) C(99%)

HSS

β−turn type VIII

φ

ψ

23 50 18 6E(100%) C(75%) C(99%) E(100%)

HSS

β−hairpin type Iʼ

φ

ψ

23 47 8 6E(100%) C(93%) C(100%) E(100%)

HSS

β−hairpin type IIʼ

φ

ψ

Fig. 4. Hidden node paths corresponding to well known structural motifs. The angular preferences for the hidden node paths are illustrated by using the mean� (�) value as a circle (square), with the error bars denoting 1.96 standard deviations of the corresponding bivariate von Mises distribution. Because the angulardistributions are approximately Gaussian at high concentrations, this corresponds to �95% of the angular samples from that distribution. In cases where idealangular preferences for these motifs are known from the literature, they are specified in green. H, hidden node sequence; SS, secondary structure labeling (H,helix; E, strand; C, coil), with corresponding emission probabilities in parentheses.

8934 � www.pnas.org�cgi�doi�10.1073�pnas.0801715105 Boomsma et al.

59

Structural Motifs. The TorusDBN models the sequential dependen-cies along the protein backbone through a first-order Markov chainof hidden states. In such a model, we expect longer range depen-dencies to be modeled as designated high-probability paths throughthe model.

By manually inspecting the paths of length 4 with highestprobability according to the model (based on their transitionprobabilities), we indeed recovered several well known structuralmotifs. Fig. 4 demonstrates how eight well known structural motifsappear as such paths in the model. Both the emitted angle pairs(Fig. 4) and the amino acid preferences (Fig. S1) have goodcorrespondence with the literature (16, 17) (see SI Text). Allreported paths are among the 0.25% most probable 4-state paths inthe model (out of the 554 possible paths).

Often, structural motifs will arise from combinations ofseveral hidden node paths. By summing over the contributionsof all possible paths [posterior decoding (18)], it is possible toextract this information from the model. To illustrate, wereversed the analysis of the structural motifs, by giving theideal angles and secondary structure labeling of a motif asinput to the model, and calculating the posterior distributionover amino acids at each position. Table 1 lists the top threepreferred amino acids for each position in the different �-turnmotifs. All of these amino acids have previously been reportedto have high propensities at their specific positions (17).

Sampling Structures. We conclude with a demonstration of themodel’s performance beyond the scope of well defined structural

motifs. In the context of de novo structure prediction, the role of themodel is that of a proposal distribution, where repeated resamplingof angles should lead to an efficient exploration of conformationalspace. In this final experiment, we therefore sampled dihedralangles for the proteins in our test set and investigated how closelythe sampled angles match those of the native state.

For each protein in the test set, 100 structures were sampled, andthe average angular deviation was recorded (see SI Text). This wasdone for an increasing amount of input to the model. Initially,samples were generated without using input information, resultingin unrestricted samples from the model. We then included theamino acid sequence of the protein, a predicted secondary structurelabeling (using PSIPRED), and, finally, a combination of both. Weran the same test with Rosetta’s fragment assembler forcomparison.

Fig. 5 shows the distribution of the average angular distance overall proteins in the test set. Clearly, as more information becomesavailable, the samples lie more closely around the native state.When both amino acid and secondary structure information is used,the performance of the TorusDBN approaches that of the fragmentassembler in Rosetta. Recall that Rosetta also uses both amino acidand secondary structure information in its predictions but, inaddition, incorporates multiple sequence information directly,which TorusDBN does not. In this light, our model performsremarkably well in this comparison. The time necessary to generatea single sample, averaged over all of the proteins in the test set, was0.08 s for our model and 1.30 s for Rosetta’s fragment assembler.All experiments were run on a 2,800 MHz AMD Opteronprocessor.

To illustrate the effect of the different degrees of input, weinclude a graphical view of two representative fragments ex-tracted from the samples on the test set (Fig. 6). Note how thesequence and secondary structure input provide distinct signalsto the model. In the hairpin motif, the sequence-only signalcreates structures with an excess of coil states around the hairpin,whereas the inclusion of only secondary structure input gets thesecondary structure elements right but fails to make the turncorrectly. Finally, with both inputs, the secondary structureboundaries of the motif are correct, and the quality of the turnis enhanced through the knowledge that the sequence motifAsp-Gly is found at the two coil positions, which is common fora type I� hairpin (17).

Additional Evaluations. We conducted several additional experi-ments to evaluate other aspects of the model. First, we performeda detailed evaluation of TorusDBN�s performance on local struc-ture motifs using the I-sites library (19) (SI Text and Figs. S2–S4).

tupni oN

qeS

SS

SS

+qeS

attesoR

0.8

1.0

1.2

1.4

1.6

1.8

DS

MR ralugn

A

Fig. 5. Box-plots of the average angular deviation in radians (see SI Text)between native structures from the test set, and 100 sampled structures. Fromleft to right, an increasing amount of information was given to the model: Noinput data, amino acid input data (Seq), predicted secondary structure inputdata (SS), and a combination of both (Seq�SS). The rightmost box correspondsto candidate structures generated by the fragment assembler in Rosetta.

Table 1. Amino acid propensities for turn motifs calculated by using TorusDBN

Name Position

Input Output

(�, �) SS AA

�-Turn type I 1 (�60, �30) C P (3.2130) S (1.5816) E (1.3680)2 (�90, 0) C D (2.4864) N (2.1854) S (1.5417)

�-Turn type II 1 (�60, 120) C P (3.9598) K (1.4291) E (1.4234)2 (80, 0) C G (10.6031) N (1.0152)

�-Turn type VIII 1 (�60, �30) C P (3.4599) S (1.3431) D (1.3290)2 (�120, 120) C V (1.9028) I (1.8459) F (1.3373)

�-Hairpin type I’ 1 (60, 30) C N (5.9596) D (2.3904) H (1.6610)2 (90, 0) C G (12.4208)

�-Hairpin type II’ 1 (60, �120) C G (11.2226)2 (�80, 0) C N (2.9914) D (2.8430) H (1.5844)

The propensity of a particular amino acid (columns 5–7) at a certain position (column 2) in a motif (column 1)is calculated as the posterior probability P(ad, s) divided by the probability of that amino acid according to thestationary distribution P(a) of the model. Angular and secondary structure input are listed in columns 3 and 4. Thethree most preferred amino acids (with propensities �1) are reported.


BIO

PHYS

ICS

60

Second, we compared TorusDBN directly to HMMSTR in therecognition of decoy structures from native (SI Text and Tables S2and S3), and finally, the length distributions of secondary structureelements in samples were analyzed (SI Text and Fig. S5). All thesestudies lend further support to the quality of the model.

Potential Applications. In closing, we list a few potential applica-tions for the described model. First and foremost, it is in thecontext of de novo predictions that we expect the greatestbenefits from our model. Seamless resampling and probabilityevaluations of proposed structures should provide a bettersampling of conformational space, allowing calculations of ther-modynamical averages in MCMC simulations (20). There are,however, several other potential areas of application. (i) Ho-mology modeling, where the model is potentially useful as aproposal distribution for loop closure tasks; (ii) quality verifi-

cation of experimentally determined protein structures, where itis likely that the sequential signal in our model constitutes anadvantage over the current widespread use of Ramachandranplots to detect outliers; and (iii) protein design, where the modelmight be used to predict or sample amino acid sequences that arelocally compatible with a given structure (as was demonstratedfor short motifs in Table 1).

MethodsParameter Estimation. The model was trained by using the Mocapy DBN toolkit(21). As training data, we used the SABmark 1.65 twilight protein dataset, whichfor each different SCOP-fold provides a set of structures with low sequencesimilarity (13). Training was done on structures from 180 randomly selected folds(1,447 proteins, 226,338 observations), whereas the remaining 29 folds (276proteins, 42,654 observations) were used as a test set. Amino acid, trans/cispeptide bond, and angle pair information was extracted directly from the train-ing data, whereas secondary structure was computed by using DSSP (22).

Because the hidden node values are inherently unobserved, an algorithmcapable of dealing with missing data is required. Here, we used a stochasticversionof thewellknownexpectation-maximization (EM)algorithm(23,24).Theidea behind stochastic EM (25, 26) is to first fill in plausible values for all unob-served nodes (E-step), and then update the parameters as if the model was fullyobserved (M-step). Just as with classic EM, these two steps are repeated until thealgorithm converges. In our case, for each observation in the training set, wesampled a corresponding h value, using a single sweep of Gibbs sampling: inrandom order, all h values were resampled based on their current left and rightneighboring h values and the observed emission values at that residue. Compu-tationally, stochastic EM is more efficient than classic EM. Furthermore, on largedatasets, stochastic EM is known to avoid convergence to local maxima (26).

The optimal size of the hidden node (i.e., the number of states that it canadopt) is a hyperparameter that is not automatically estimated by the EM pro-cedure. We optimized this parameter by training models for a range of sizes,evaluating the likelihood for each model using the forward algorithm (18).Because the training procedure is stochastic in nature, we repeated this proce-

No input Sequence input Predicted SS input Sequence + Pred. SS input

)831–221(Ar6l1

)65–64(Alzk1

Fig. 6. Two representative examples of samples generated by TorusDBN on the proteins in the test set (1eeoA, position 2–14 and 1kzlA, position 46–56). Eachimage contains the native structure in blue and a cloud of 100 sampled structures. The sampled structure with minimum average distance to all other samplesis chosen as representative and highlighted in red. From left to right, an increasing amount of input is given to the model. Note, that the leftmost structures aresampled without any input information and are therefore not specific to these proteins. They are included here merely as a null model. Figures were createdby using Pymol (29).

20 30 40 50 60 70 80

271−

471−

671−

Hidden node size

)4e1( CI

B

Fig. 7. BIC values for models with varying hidden node size. For each size, fourindependent models were trained. The model used for our analyses is highlighted in red.

8936 � www.pnas.org�cgi�doi�10.1073�pnas.0801715105 Boomsma et al.

61

dure several times. The best model was selected by using the Bayesian Informa-tion Criterion (BIC) (27), a score based on likelihood, which penalizes an excess ofparameters and thereby avoids overfitting (see SI Text). As displayed in Fig. 7, theBIC reaches a maximum at a hidden node size of �55. The model, however,appears to be quite stable with regard to the choice of this parameter. Several ofthe experiments in our study were repeated with different h size models (size40–80) without substantially affecting the results.

Angular Probability Distribution. The Ramachandran plot is well known incrystallography and biochemistry. The plot is usually drawn as a projection ontothe plane, but because of the periodicity of the angular degrees of freedom, thenatural space for these angle pairs is on the torus. To capture the angularpreferences of protein backbones, a mixture of Gaussian-like distributions on thissurface is therefore an appropriate choice. We turned to the field of directionalstatistics for a bivariate angular distribution with Gaussian-like properties thatallows for efficient sampling and parameter estimation. From the family ofbivariate von Mises distributions, we chose the cosine variant, which was espe-cially developed for this purpose by Mardia et al. (28). The density function isgiven by

f��, �� c��1,�2,�3�exp��1cos�� [1]

�2cos(��)��3cos�� ).

The distribution has five parameters: and � are the respective means for � and�, �1 and �2 their concentration, and �3 is related to their correlation (Fig. 8). Theparameters can be efficiently estimated by using a moment-estimation tech-nique. Efficient sampling from the distribution is achieved by rejection sampling,using a mixture of two von Mises distributions as a proposal distribution (see SIText).

Availability. The TorusDBN model is implemented as part of the backboneDBNpackage, which is freely available at http://sourceforge.net/projects/phaistos/.

ACKNOWLEDGMENTS. We thank Mikael Borg, Jes Frellsen, Tim Harder, KasperStovgaard, and Lucia Ferrari for valuable suggestions to the paper; John Kent fordiscussions on the angular distributions; Christopher Bystroff for help withHMMSTR and the newest version of I-sites; and the Bioinformatics Centre and theZoological Museum, University of Copenhagen, for use of their cluster computer.W.B. was supported by the Lundbeck Foundation, and T.H. was funded byForskningsradet for Teknologi og Produktion (“Data Driven Protein StructurePrediction”).

1. Dill KA, Ozkan SB, Weikl TR, Chodera JD, Voelz VA (2007) The protein folding problem:When will it be solved? Curr Opin Struct Biol 17:342–346.

2. Jones TA, Thirup S (1986) Using known substructures in protein model building andcrystallography. EMBO J 5:819–822.

3. Simons KT, Kooperberg C, Huang E, Baker D (1997) Assembly of protein tertiarystructures from fragments with similar local sequences using simulated annealing andBayesian scoring functions. J Mol Biol 268:209–225.

4. Chikenji G, Fujitsuka Y, Takada S (2006) Shaping up the protein folding funnel by localinteraction: Lesson from a structure prediction study. Proc Natl Acad Sci USA 103:3141–3146.

5. Jauch R, Yeo H, Kolatkar P, Clarke N (2007) Assessment of CASP7 structure predictionsfor template free targets. Proteins 69 Suppl 8:57–67.

6. Bystroff C, Thorsson V, Baker D (2000) HMMSTR: a hidden Markov model for localsequence-structure correlations in proteins. J Mol Biol 301:173–190.

7. Edgoose T, Allison L, Dowe DL (1998) An MML classification of protein structure thatknows about angles and sequence. Pac Symp Biocomput 3:585–96.

8. Camproux AC, Tuffery P, Chevrolat JP, Boisvieux JF, Hazout S (1999) Hidden Markovmodel approach for identifying the modular framework of the protein backbone.Protein Eng Des Sel 12:1063–1073.

9. Hamelryck T, Kent JT, Krogh A (2006) Sampling realistic protein conformations usinglocal structural bias. PLoS Comput Biol 2:e131.

10. Ramachandran GN, Ramakrishnan C, Sasisekharan V (1963) Stereochemistry ofpolypeptide chain configurations. J Mol Biol 7:95–99.

11. Engh RA, Huber R (1991) Accurate bond and angle parameters for x-ray proteinstructure refinement. Acta Crystallogr A 47:392–400.

12. Cawley SL, Pachter L (2003) HMM sampling and applications to gene finding andalternative splicing. Bioinformatics 19 Suppl 2:ii36–ii41.

13. Van Walle I, Lasters I, Wyns L (2005) SABmark—a benchmark for sequence alignmentthat covers the entire known fold space. Bioinformatics 21:1267–1268.

14. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of proteindatabase search programs. Nucleic Acids Res 25:3389–3402.

15. Jones DT (1999) Protein secondary structure prediction based on position-specificscoring matrices. J Mol Biol 292:195–202.

16. Aurora R, Rose GD (1998) Helix capping. Protein Sci 7:21–38.17. Hutchinson EG, Thornton JM (1994) A revised set of potentials for �-turn formation in

proteins. Protein Sci 3:2207–2216.18. Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological Sequence Analysis (Cam-

bridge Univ Press, Cambridge, UK).19. Bystroff C, Baker D (1998) Prediction of local structure in proteins using a library of

sequence-structure motifs. J Mol Biol 281:565–577.20. Winther O, Krogh A (2004) Teaching computers to fold proteins. Phys Rev E 70:30903.21. Hamelryck T (2007) Mocapy: A Parallelized Toolkit for Learning and Inference in

Dynamic Bayesian Networks. Manual (Univ of Copenhagen, Copenhagen).22. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern recog-

nition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637.23. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data

via the EM algorithm. J R Stat Soc B 39:1–38.24. Ghahramani Z (1998) Learning dynamic Bayesian networks. Lect Notes Comput Sci

1387:168–197.25. Diebolt J, Ip EHS (1996) Markov Chain Monte Carlo in Practice, eds Gilks WR, Richardson

S, Speigelhalter DJ (Chapman & Hall/CRC), pp 259–273.26. Nielsen SF (2000) The stochastic EM algorithm: Estimation and asymptotic results.

Bernoulli 6:457–489.27. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464.28. Mardia KV, Taylor CC, Subramaniam GK (2007) Protein bioinformatics and mixtures of

bivariate von Mises distributions for angular data. Biometrics 63:505–512.29. DeLano WL (2002) The PyMOL User’s Manual (DeLano Scientific, San Carlos, CA).

Fig. 8. Samples from two bivariate von Mises distributions, corresponding to two hiddennodes states. The red samples (h-value 20) represent a highly concentrated distribution(�1 � 65.4, �2 � 45.7, �3 � 17.3, � �66.2, � � 149.6), whereas the blue samples (h-value39) are drawn from a less concentrated distribution (�1 � 3.6, �2 � 1.9, �3 � �0.8, � 67.4,� � 96.2).


BIO

PHYS

ICS

62

Supporting InformationBoomsma et al. 10.1073/pnas.0801715105SI TextQuantitative Analysis of the Quality of the Ramachandran Plot. Thissection contains details on Angular Preferences in Results andDiscussion. As described in the main text, for each protein in thetest set, a structure was generated by using TorusDBN withsequence and predicted secondary structure as input. Theangular preferences of the generated samples were visuallycompared to the native preferences by using Ramachandranplots (Fig. 3). This section contains a more quantitative evalu-ation of the similarity between the plots and includes a com-parison with Rosetta’s fragment assembler.

As a measure of similarity, we used the Kullback–Leibler (KL)divergence, or relative entropy (1, 2), which is a natural distancebetween two probability distributions. For discrete probabilitydistributions P and Q, it is defined as:

KL�P ,Q� � �i

P� i� logP� i�Q� i�

,

where the sum runs over all possible values. Typically, P is anempirical probability distribution obtained from observations ordata or a calculated reference probability distribution, whereasQ represents a model or an approximation of P. The divergenceis a positive number, becoming zero if and only if the twodistributions are equal. If the logarithm in the formula is takento be base 2, the divergence is measured in bits, whereas for basee, the unit is the nat, which is used here.

The KL divergence allows us to judge the similarity betweenthe Ramachandran plot of the test set on the one hand, and theRosetta and TorusDBN samples on the other. A continuousversion of the KL divergence exists, but it cannot be applied here,because only TorusDBN allows for the calculation of a proba-bility distribution over continuous space. Therefore, to apply thediscrete KL divergence formula, we covered the Ramachandranplot area with a square n � n grid and calculated the probabilityfor a point to belong to each square in the grid for the three sets(native data, TorusDBN samples, and Rosetta samples). In thiscase, the sum in the KL divergence formula thus runs over thenumber of squares in the grid, the native data are represented bythe probability distribution P, and the samples are representedby Q. Squares without observations were avoided by mergingthem with populated neighbors.

We find that the KL divergence between the native data andthe TorusDBN samples is smaller (�0.04 nats) than the KLdivergence between the native data and the Rosetta samples.This is irrespective of the size of the grid (ranging from 100 �100 to 1,000 � 1000, see Table S1. Hence, the KL divergenceanalysis indicates that the Ramachandran plot derived from theTorusDBN samples is more similar to the native Ramachandranplot than the one derived from the Rosetta samples.

Length Distribution of Secondary-Structure Elements. To get a firstimpression of the sequential dependencies encoded in Torus-DBN, we investigated the degree to which the model captures thecorrect length distribution of secondary-structure elements.

For each structure in the test set, we sampled a structure of thesame length using TorusDBN. We sampled from the modelwithout input information, because we are not interested in theposition-wise correct identification of secondary structure but,rather, the average length distributions of regions of the threesecondary structure types. Both samples and native structureswere annotated with secondary structure by using P-SEA (3).

The overall percentages of �-helix (data, 33%; samples, 31%),�-strand (data, 26%; samples, 27%), coil (data, 41%; samples,42%), and cis-peptides (data, 0.24%; samples, 0.24%) wereremarkably similar. More importantly, the lengths of the ob-served secondary structure elements were also distributed sim-ilarly in the two sets (Fig. S5). We did observe some discrepan-cies for helix lengths. The model produces an excess of shorthelices compared with the dataset, which might be explained bya stabilization of helices through nonlocal interactions. However,for all secondary structure types, the general form of the lengthdistribution is very well reproduced, indicating that the first-order Markov property of the model is sufficient to capture thenecessary sequential dependencies, at least to a first approxi-mation.

Note, that the P-SEA algorithm assigns secondary structurewithout regard to hydrogen bonds and thus detects �-strandseven when they are not stabilized in sheets. If a hydrogenbond-based assignment method [e.g., DSSP (4)] is used for thecomparison, samples from our model are annotated with hardlyany �-strands. This is to be expected, because our model containsonly local information. In a full folding simulation, the formationof �-sheets would be the domain of a hydrogen-bond term in anenergy function.

Structural Motifs. High-probability hidden node paths encoding structuralmotifs. Several high-probability paths of hidden nodes corre-spond to common structural motifs. These were listed in Fig. 4.Fig. S1 contains the corresponding amino acid preferences. Thepreferences were measured as the ratio between the emissionprobability and probability of that amino acid according to thestationary distribution of the model (5). Only amino acids thatare preferred over the baseline (ratio �1), are included in thefigure.

The following is known about these motifs in the literature (inparentheses, we indicate how the mentioned motif positionscorresponds to the positions in Fig. S1):

(A) The N-capping box normally consists of a S, T, N, or Dresidue at the capping position (position 2), often preceded by(position 1) a hydrophobic residue (L, I, V, M, or F) (6). P andE occur frequently at the first helix position (position 3), whereasthe second helix position (position 4) has high preference for Eand D (7). Generally, G is known to occur frequently at theN-cap position (position 2) because it allows a better solvationof the amide groups (6).

(B) The Schellman motif has a very high propensity for G atthe position after the cap (position 3) because of the requirementfor a positive value of the backbone dihedral angle �. The residuefollowing G (position 4) is apolar, most frequently I, L, and F (8).The residue preceding the cap (position 1) is usually either A ora polar residue (9).

(C) The proline C-cap is mainly identified by the P at positionC� (position 3). At the capping position (position 2), H and Nhave been identified as the most preferred candidates (8).

(D–F) The amino-acids with highest preferences at eachposition in the �-turn type I have been identified as position 1:D,N,H,C; position 2: P,E,S; position 3: D,N,T,S; position 4: G(10). In the same study, Hutchinson and Thornton found thepreferred amino acids in �-turn type II to be, position 1: P,T;position 2: P,K; position 3: G,N; position 4: C,K,S, and for �-turntype VIII, position 1: P,G; position 2: P,D; position 3: N,D,V,F;position 4: P,T.

(G and H) Preferred amino acids in the two loop positions of

Boomsma et al. www.pnas.org/cgi/content/short/0801715105 1 of 12

63

�-hairpin type I� have previously been identified as position 2:N,H,D,G and position 3: G (11, 10). In the loop positions of�-hairpin type II�, these are position 2: G and position 3: N,D,S(11, 10).

In general, we observe an excellent agreement between theamino acid emission probabilities of the TorusDBN and theamino acid preferences identified for these motifs in previousstudies.

Naturally, given the limited number of hidden nodes, themodel must reuse the same hidden nodes in different contexts.This effect can be seen, for instance, in the recurrent presenceof h � 4 in Fig. S1. We have also noted that the model tends touse the same path of hidden nodes to model the Schellman capas it does to model the first part of the helix–turn–helix motif(data not shown). Such recycling of nodes explains the fewdiscrepancies with the literature that we observe in Fig. S1. Forinstance, although M is not known to occur at the first positionof the Schellman cap, it fits perfectly with the helix–turn–helixmotif, where the following sequences are most frequently ob-served: position 1: M,A; position 2: L,M; position 3: G; position4: V,M,I (12).The I-sites library. Here, we extend the structural motif experimentfrom the article with a much larger set of motifs: the I-sitesfragment library (version 16.10) (13). For each fragment, wecompared samples from our model with the actual structure ofthe fragment [the so-called paradigm structure (13)]. To avoidintroducing edge effects, for each motif, we sampled structuresfor the entire protein and then extracted the relevant fragment.

We ran the model with an increasing amount of input infor-mation (no input, sequence only, secondary structure only,sequence and secondary structure). For each fragment and foreach input setting, 1,000 structures were sampled from themodel, and the average RMSD to the fragment structure wascalculated (Fig. S2). Two examples are highlighted to give avisual impression of the effect of the various types of input (Fig.S3).

As was the case in Fig. 5, the sampled structures are centeredmore tightly around the library fragment structure as moreinformation becomes available. We also note that there is asignificant signal in the secondary structure information, com-pared with the sequence alone. There are two plausible expla-nations for this observation. First, secondary structure predic-tion methods like PSIPRED (14) base their prediction on awindow around the position that is currently being predicted andare thus capable of capturing certain nonlocal dependencies thatoccur within these windows. Secondly, the predictions ofPSIPRED are based on sequences of a range of homologueproteins rather than just the sequence of the target structure.Note that these two explanations correspond to very distinctsources of information: either a nonlocal signal corresponding todependencies within a window or the evolutionary variation inthe amino acid sequence that occurs for a given structure. Toidentify the source of the signal, we turned matters around andused TorusDBN to predict secondary structure given sequence(using posterior decoding, as was done to predict amino acidlabels in Table 1). We then compared the results with thoseobtained with the classic single-sequence window-based predic-tion method GOR-IV (15). If the additional signal originatesfrom the nonlocal effects within a window, we would expectGOR-IV to significantly outperform TorusDBN. We measuredthe secondary structure prediction performance with the Q3score, comparing predictions with the secondary structure as-signment by DSSP (4). When evaluated on the entire test set, wesee that the two methods have virtually identical performance(TorusDBN: 62.20%, GOR-IV: 62.22%). This indicates that themain source of the effect of the secondary structure input isprobably the multiple sequence signal that is used by PSIPRED.

Fig. S2 also illustrates the performance of Rosetta’s fragment

assembler on the I-sites motifs. When both amino acid andsecondary structure information is used, the performance of ourmodel approaches that of the fragment assembler in Rosetta,which also uses both amino acid and secondary structure infor-mation in its predictions. The time necessary to generate a singlesample, averaged over all of the structures containing I-sitesfragments, was 0.17 seconds for our model and 7.77 seconds forRosetta’s fragment assembler. All experiments were run on a2,800 MHz AMD Opteron processor.

There is a potential risk for overfitting in this study. Althoughwe, in this experiment, exclude I-sites fragments from proteinsthat are present in the training set, this is generally not sufficientto avoid overfitting. Ideally, we should disqualify structuralmotifs with high sequence or structural similarity to our trainingset, but it is not trivial to decide which motifs qualify. Inparticular, the fact that each I-sites motif is a representative fora larger set of PDB structures (which is unknown to us), makesit difficult to make rigorous rules for exclusion. Furthermore, itis not immediately clear whether it is meaningful to exclude orinclude motifs based on the homology properties or fold classi-fication of the proteins in which they are contained, which areoften at least 10 times as long. On the other hand, exclusionbased on the motifs themselves is problematic because of theshort lengths of these fragments. Note, however, that the Rosettaresults in Fig. S2 suffer from the same problem, probably to aneven greater extent. The fairness of the comparison relies on theassumption that Rosetta uses an internal fragment library that issignificantly different from the I-sites library. In reality, giventheir common origin, it is likely that some of the fragments in theI-sites library are actually contained within Rosetta’s fragmentlibrary. We stress that the study that is included in our article(Figs. 5 and 6) does not have these overfitting issues, because thetraining and test set are designed to be clearly separated withrespect to both sequence and structure.Sampling structures—secondary structure-specific analysis. Fig. S4 cor-responds to Fig. 5 but breaks down the signal into the three typesof secondary structure. It thus illustrates how the angle-prediction performance varies for �-helix, �-strand, and coil. Wenote that TorusDBN seems to outperform Rosetta’s fragmentassembler on the helix regions, whereas it performs worse in thecoil regions. This is not surprising, because the shape of coils is,to a great extent, determined by nonlocal interaction and istherefore difficult to capture by local models. Some semilocalmotifs involving coil regions do exist but are probably bettercaptured with individual fragments in a library. For instance,although hairpin motifs are convincingly captured by TorusDBN(Fig. 6, lower, rightmost image), the model clearly has no notionof the hydrogen bonding that is necessary to make the strandsline up perfectly. A hairpin fragment in a fragment library would,in general, capture the strand endpoints of an ideal hairpin moreaccurately, but any deviations from the ideal would have to bemodeled by separate fragments.

In general, hydrogen bonding between strands is a nonlocaleffect that is normally modeled with a nonlocal energy function.The fact that fragment libraries can capture such bonds in certainsituations (e.g., in hairpins) is not as great an advantage as itmight seem. To avoid scoring the same structural propertiestwice with both the (implicit) local energy (i.e., the fragmentassembler) and the nonlocal energy function, one must have aclear division between the concepts of local and nonlocalstructure.

Comparison with HMMSTR: Evaluation of Decoys. The TorusDBNwas designed to be used a a proposal distribution in de novoprotein structure-prediction simulations. In such applications,the model will often act as an implicit energy term in thesimulation. Here, we have a look at the quality of this energy


64

term and how it compares with the energy encoded in theHMMSTR method (16).

Over the years, a number of decoy sets have been published.The underlying idea is to hide the native structure of a proteinamong a set of artificially created structures. Energy functionscan then be benchmarked on these sets by their ability to detectthe native structure among the decoys. We will determine towhat degree the TorusDBN is capable of making this distinction,based solely on the local structure of the decoys. As a compar-ative method, we use the established HMMSTR model.HMMSTR, like TorusDBN, is a probabilistic model of localstructure, but, unlike TorusDBN, it uses a discretized represen-tation of the conformational space. To the best of our knowl-edge, HMMSTR is the only publicly available probabilisticmodel of detailed local structure and sequence and thus formsthe ideal subject for a direct comparison. Note that a comparisonwith conventional knowledge-based energy functions would notbe fair, because they include nonlocal terms, which is not the casefor HMMSTR and TorusDBN.

We tested the ability of both methods to recognize the nativestructure among a set of decoy structures for a standard set of35 high-quality crystal structures (17). The decoys were gener-ated by a wide array of methods, ranging from lattice enumer-ation to molecular dynamics (see ref 17 for a discussion and a listof decoy set references). This standard decoy set was augmentedwith recent sets generated by Rosetta [the rosetta07 decoy set(18)] and TASSER (19), which both belong to the currentstate-of-the-art methods in de novo protein structure prediction.In total, 49 decoy sets were used. We used two measures tocompare the performance of HMMSTR and TorusDBN. First,we evaluated how high the native structure is ranked among thedecoys, according to the logarithm of the joint probability oflocal structure and sequence, log(P(A,X)). Second, we calculatedthe Z score, which is the number of standard deviations �between log(P(Xn,A)) of the native structure Xn, and the mean� of log(P(X,A)) for the whole set:

Z �log�P�Xn, A��

�. [2]

A large, positive Z score is ideal. In order not to bias thecomparison in our favor, we retrained the TorusDBN model sothat the training set did not contain any folds present among thenative structures of the decoy set.

The decoys created by Rosetta proved to be considerably morechallenging than the remaining decoys, indicating that the localstructure in these decoys is probably of higher quality. Wetherefore present the results separately for Rosetta in Table S3.The remaining results are listed in Table S2. When comparingthe performance of HMMSTR and TorusDBN, it becomes clearthat TorusDBN performs very favorably. In Table S2, Torus-DBN recognizes the native structures in all 28 cases, whereasHMMSTR fails in 8. Also, the Z value of TorusDBN is higherthan the one obtained from HMMSTR in all but one case. TheRosetta decoys are clearly more difficult. Here, TorusDBNrecognizes the native state in only 6 of 21 cases but stilloutperforms HMMSTR’s result of 4. Note that the negativecases are primarily from the rosetta07 decoy set.

Generally, given the fact that only local structure is taken intoaccount, both HMMSTR and TorusDBN perform remarkablywell in discriminating native structure from decoys (24 and 34 of49 cases, respectively). This is true for all decoy-generatingmethods, with the exception of the most recent Rosetta decoys(rosetta07), for which both methods fail to identify the nativestructure in all 10 cases. This result confirms that the most recentversion of the Rosetta method performs very well in themodeling of local structure. We conclude that TorusDBN com-pares very favorably with the established HMMSTR method and

point out that TorusDBN is a generative model that can be usedfor sampling local structure directly in continuous space, whichis not the case for HMMSTR because of its discretized nature.

Methods. Estimation of the number of hidden nodes: The BIC measure.The Bayesian Information Criterion (BIC) is a well establishedstatistical method for model selection (20, 21, 22). The score iscalculated as

BIC � 2ln�L� � p ln�n� [3]

where L is the maximum likelihood of the model, p is the numberof parameters, and n is the number of observations. Intuitively,for an increasing number of parameters, the maximum likeli-hood of a model will generally increase, making it impossible todo model selection based on likelihood alone. This measuretherefore includes a term that penalizes any increase of param-eters that is not supported by a significant increase in likelihood.By optimizing this measure, one seeks to find the optimalnumber of parameters, given the available amount of data, as anattempt to avoid overfitted models. In practice, the BIC is knownto slightly overpenalize the complexity of a model, and therebyhas a tendency toward underfitting (2, 23).Bivariate von Mises distribution—parameter estimation and sampling.The parameters of the bivariate von Mises distribution (cosinevariant) can be efficiently estimated by using a moment-estimation technique. The means � and � can be estimatedsimply as the empirical means of � and . For 1, 2, and 3, weused the following technique. For large concentrations, we have

cos�x� � 1 � 1/2x2 [4]

and the density function can be approximated by a bivariateGaussian with inverse covariance matrix (24)

C��,��1 � �1 � 3 3

3 2 � 3�. [5]

Using x � sin(x), we have

C��,� � � var�sin�� cov�sin� , sin�cov�sin� , sin� var�sin� � , [6]

which gives us the moment estimates

1 �S� 2 � S� 12

S� 1S� 2 � S122 [7]

2 �S� 1 � S� 12

S� 1S� 2 � S� 122 [8]

3 �S� 12

S� 1S� 2 � S� 122 , [9]

where

S� 1 � 1/n�i

sin2� i [10]

S� 2 � 1/n�i

sin2 i [11]

S� 12 � 1/n�i

sin� i sin i [12]

are empirical estimates, and �i, i are measured from theirempirical means. These moment estimates are not as accurate asmaximum-likelihood estimates achieved by numeric optimiza-


65

tion, but they are considerably faster and proved sufficientlyaccurate for our purposes.

Sampling from the distribution involves several steps. Theoverall strategy is to draw directly from the marginal distri-bution f(), after which � is drawn from the conditional vonMises distribution f(� ). The marginal density is given by (24)

f�� c�1, 2, 3�2�I0�13�exp�2 cos� � �� , [13]

where

132 � 1

2 � 32 � 213 cos� � �� , [14]

and I0 is the modified Bessel function of order zero. We samplefrom this distribution using rejection sampling. As the proposaldistribution, we use a mixture of two von Mises distributions M(� ,) and M(� � ,), where is a shift parameter that isdifferent from zero only if the marginal density is bimodal. Theconcentration parameter is numerically optimized to minimizethe maximum distance between the proposal distribution and thedensity function. The computational cost for this numericalcalculation is not an issue, because the parameters of theseproposal distributions can be precalculated for the trainedmodel. Eq. 1 can also be used to calculate a value for thenormalization constant c(1, 2, 3) (by using numerical inte-gration). Again, these values can be precalculated for all mixturecomponents and constitute no performance barrier once themodel is trained. The angle � can now be sampled from theconditional density f(� ), which is von Mises, M(�, 13()),with (24)

tan�� 3 sin� � ��

1 � 3 cos� � ��. [15]

The forward–backtrack algorithm. The main issue when samplingfrom the model is how to draw a (sub)sequence of hidden node

values so that consistency is maintained along the chain. For thispurpose, we use the forward–backtrack dynamic programmingalgorithm (25). Let oi be the set of observable nodes {ai, di, si, ci},where i refers to the sequence position and

P�oi hi� � P�ai�hi�P�di hi�P�si hi�P�ci�hi�. [16]

If some of the emission nodes are unobserved, the correspondingfactor P(� hi) is set to 1. Let h�n � {hs1,hs2,. . . ,hsn} be asubsequence of h of length n starting at position s 1 that wewish to resample, with hs and ht � hsn1 as the h values at theboundaries. Similarly, we define the subsequence of observationsas o�n � {os1,os2,. . . ,osn}. The aim is now to draw a samplefrom the distribution P(h�n o�n, hs, ht). The sampling procedureproceeds by sampling h�i one value at a time from i � n. . . 1 fromthe distribution

P�h�i o�i, hs, h�i1� �P�h�i, o�i, hs, h�i1�

P�o�i, hs, h�i1�[17]

P�h�i1 h�i�P�h�i, o�i, hs� ,

where h�n1 � ht, and the second step is due to the indepen-dencies encoded in the graph of the model (Fig. 2). The P(h�i, o�i,hs) values can be efficiently precalculated by using the wellknown forward algorithm (26).

Measure of Angular Deviation. In analogy to the usual root mean-square deviation measure, we define the angular deviation Dbetween two vectors of angles x1 and x2 as

D�x1, x2� � �1/n�i

�min� �x2i � x1i� , 2� � �x2i � x1i��2,

[18]

where the angles are measured in radians.

1. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86.2. Bishop CM (2006) Pattern Recognition and Machine Learning (Springer, New York).3. Labesse G, Colloc’h N, Pothier J, Mornon JP (1997) P-SEA: A new efficient assignment

of secondary structure from C� trace of proteins. Comput Appl Biosci 13:291–295.4. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern recog-

nition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–637.5. Bremaud P (1999) Markov Chains: Gibbs Fields, Monte Carlo Simulation and Queues

(Springer, New York).6. Serrano L (2000) The relationship between sequence and structure in elementary

folding units. Adv Protein Chem 53:49–85.7. Aurora R, Rose GD (1998) Helix capping. Protein Sci 7:21–38.8. Gunasekaran K, Nagarajaram HA, Ramakrishnan C, Balaram P (1998) Stereochemical

punctuation marks in protein structures: Glycine and proline containing helix stopsignals. J Mol Biol 275:917–932.

9. Aurora R, Srinivasan R, Rose GD (1994) Rules for alpha-helix termination by glycine.Science 264:1126–1130.

10. Hutchinson EG, Thornton JM (1994) A revised set of potentials for �-turn formation inproteins. Protein Sci 3:2207–2216.

11. Gunasekaran K, Ramakrishnan C, Balaram P (1997) �-Hairpins in proteins revisited:lessons for de novo design. Protein Eng 10:1131–1141.

12. Dodd IB, Egan JB (1990) Improved detection of helix–turn–helix DNA-binding motifs inprotein sequences. Nucleic Acids Res 18:5019–26.

13. Bystroff C, Baker D (1998) Prediction of local structure in proteins using a library ofsequence-structure motifs. J Mol Biol 281:565–577.

14. Jones DT (1999) Protein secondary structure prediction based on position-specificscoring matrices. J Mol Biol 292:195–202.

15. Garnier J, Gibrat JF, Robson B (1996) GOR secondary structure prediction methodversion IV. Methods Enzymol 266:540–553.

16. Bystroff C, Thorsson V, Baker D (2000) HMMSTR: A hidden Markov model for localsequence–structure correlations in proteins. J Mol Biol 301:173–90.

17. Gilis D (2004) Protein decoy sets for evaluating energy functions. J Biomol Struct Dyn21:725–36.

18. Das R, et al. (2007) Structure prediction for CASP7 targets using extensive all-atomrefinement with Rosetta@home. Proteins 69 Suppl 8:118–128.

19. Wu S, Skolnick J, Zhang Y (2007) Ab initio modeling of small proteins by iterativeTASSER simulations. BMC Biol 5:17.

20. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464.21. Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via

model-based cluster analysis. J Comput 41:578–588.22. Cappe O, Moulines E, Ryden T (2005) Inference in Hidden Markov Models (Springer,

New York).23. Chickering DM, Heckerman D (1997) Efficient approximations for the marginal likeli-

hood of Bayesian networks with hidden variables. Machine Learn 29:181–212.24. Mardia KV, Taylor CC, Subramaniam GK (2007) Protein bioinformatics and mixtures of

bivariate von Mises distributions for angular data. Biometrics 63:505–512.25. Cawley SL, Pachter L (2003) HMM sampling and applications to gene finding and

alternative splicing. Bioinformatics 19 Suppl 2:ii36–ii41.26. Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological Sequence Analysis (Cam-

bridge Univ Press, Cambridge, UK).27. DeLano WL (2002) The PyMOL User’s Manual (DeLano Scientific, San Carlos, CA).


66

4 5 5 2 8 9 C(100%) C(97%) H(100%) H(100%) V (2.31 ) S (3.88 ) P (4.96 ) E (2.33 ) I (2.26) T (3.41) E (1.57 ) D (1.63) F (1.45) G (3.35) W (1.31) Q (1.59)

H SS

AA

(A) N capping bo x 53 27 18 4

H(100%) C(83%) C(99%) C(100%) A (2.37 ) L (2.06) G (12.50) V (2.31 ) M (1.61) M (1.89) I (2.26) K (1.55 ) A (1.85 ) F (1.45)

H SS

AA

(B) Schellman C ca p 33 40 20 29

H(100%) C(98%) C(99%) C(100%) K (1.88 ) N (3.32) P (16.88 ) D (3.30) R (1.81) H (2.52) K (2.41 ) A (1.70 ) D (2.49) E (1.51 )

H SS

AA

(C) Proline C ca p 15 21 8 1 8

C(100%) C(100%) C(100%) C(99% ) D (7.59) P (2.42) N (4.05) G (12.50) N (5.57) S (2.01) D (3.56) C (3.34) E (1.52) H (1.82)

H SS

AA

(D) β −turn type I

20 16 13 4 C(99%) C(100%) C(99%) C(100%)

P (16.88 ) P (4.01 ) G (13.35) V (2.31 ) E (1.47 ) I (2.26) K (1.46 ) F (1.45)

H SS

AA

(E) β −turn type II 20 12 4 2 0

C(99%) C(98%) C(100%) C(99%) P (16.88 ) P (5.63 ) V (2.31 ) P (16.88 )

S (1.51 ) I (2.26) T (1.12) F (1.45)

H SS

AA

(F) β −turn type VII I 23 50 18 6

E(100%) C(75%) C(99%) E(100% ) V (1.92 ) N (6.80) G (12.50) W (2.35 ) C (1.69) D (2.84) F (1.78 ) Q (1.67 ) H (1.72) Y (1.75 )

H SS

AA

(G) β −hairpin type I 23 47 8 6

E(100%) C(93%) C(100% ) E (100%) V (1.92) G (11.27) N (4.05) W (2.35) C (1.69) D (3.56) F (1.78) Q (1.67) H (1.82) Y (1.75)

H SS

AA

(H) β −hairpin type II

Fig. S1. Hidden node paths corresponding to common structural motifs. H, hidden node sequence; SS, most probable secondary structure labeling (emissionprobabilities in parentheses); AA, the three most preferred amino acid emissions for each hidden node. For the amino acid emissions, the values in parenthesesdenote the propensity, calculated as the emission probability P(a h) divided by the probability of that amino acid according to the stationary distribution P(a).Only values with positive propensities are reported. All reported hidden node paths are among the 0.25% most probable 4-state paths in the model.


67

No

inpu

t

se q

SS

seq+

SS

Ros

etta

0

1

2

3

4

5

6

7

RM

SD

(Å)

Fig. S2. Box plots of the average RMSD between 1,000 sampled structures and the fragment structure for all structures in the I-sites database. From left to right,an increasing amount of information was given to the models: No input data, amino acid input data (seq), predicted secondary structure input data (SS), anda combination of both (seqSS). The rightmost box corresponds to candidate structures produced by the fragment assembler in Rosetta.


68

No input Sequence input Predicted SS input Sequence + Pred. SS input

Mot

if10

040

Mot

if15

002

Fig. S3. Two representative examples of samples generated by TorusDBN (I-sites motifs 10040 and 15002). Each image contains the native I-sites structure inblue and a cloud of 100 sampled structures. The sampled structure with minimum average distance to all other samples is chosen as representative andhighlighted in red. From left to right, an increasing amount of input is given to the model. Note, that the leftmost structures are sampled without any inputinformation and are therefore not specific to these proteins. They are included here merely as a null model. Figures were created using Pymol (27).


69

No

inpu

t

Se q

SS

Seq

+S

S

Ros

etta

0. 5

1. 0

1. 5

2. 0

Ang

ular

RM

SD

Helix

No

inpu

t

Se q

SS

Seq

+S

S

Ros

ett a

0. 5

1. 0

1. 5

2. 0

Ang

ular

RM

SD

Strand

No

inpu

t

Se q

SS

Seq

+S

S

Ros

ett a

1. 1

1. 2

1. 3

1. 4

1. 5

1. 6

1. 7

Ang

ular

RM

SD

Coi l

Fig. S4. Box plots of the average angular deviation between native structures from the test set and 100 sampled structures. The three figures illustrate theperformance on regions of helix, strand, and coil, respectively. In each figure, from left to right, an increasing amount of information was given to the models:no input data, amino acid input data (Seq), predicted secondary structure input data (SS), and a combination of both sequence information and predictedsecondary structure (SeqSS). The rightmost box corresponds to candidate structures produced by the fragment assembler in Rosetta.


70

4 6 8 1 1 1 4 1 7 2 0 2 3 2 6 2 9 3 2 3 5 0.00

0.02

0.04

0.06

0.08

0.10

0.12 Helix (H )

3 6 9 1 2 1 6 2 0 0.00

0.05

0.10

0.15

0.20 Strand (E )

1 4 7 1 0 1 3 1 6 1 9 2 2 2 5 2 8 0.00

0.05

0.10

0.15 Coil (C )

Fig. S5. Length distributions of secondary structure content in sampled structures (blue) compared with structures in the test data (black).


71

Table S1. KL divergence (in nats) between the native data andsamples

n KL(R) KL(T) KL(R) � KL(T)

100 0.118 0.081 0.037200 0.208 0.171 0.037300 0.280 0.245 0.036400 0.331 0.289 0.041500 0.362 0.319 0.043600 0.383 0.341 0.042700 0.405 0.357 0.048800 0.413 0.371 0.042900 0.425 0.385 0.0391,000 0.434 0.394 0.040

n, size of the grid; KL(R), KL divergence between the native data and theRosetta samples; KL(T), KL divergence between the native data and theTorusDBN samples; KL(R) � KL(T), Rosetta KL divergence minus the Torus KLdivergence.


72

Table S2. Results of applying TorusDBN and HMMSTR to 28 decoy sets not generated by Rosetta

PDB ID Method ZT ZH RankT RankH N

1beo lattice�ssfit 13.79 7.22 1 1 2,0011col hg�structural 4.55 4.05 1 1 301crn wang 2.94 1.58 1 10 2011csp tasser 7.19 5.32 1 1 1,2511hoe wang 4.90 2.33 1 1 3011igd lmds 5.61 4.15 1 1 5011lhm wang 4.70 3.29 1 1 2011pga protG10 3.16 1.87 1 3 700

protG2 5.65 2.94 1 1 700protG5 2.17 1.20 1 45 700protG8 3.90 1.65 1 1 700protG9 2.66 0.60 1 214 700

1pgb lattice�ssfit 10.14 8.42 1 1 2,0011pgx tasser 6.40 4.55 1 1 1,2511r69 4state�reduced 4.97 3.28 1 1 676

tasser 3.03 2.72 1 2 2,0011shf lmds 8.09 3.84 1 1 438

tasser 6.01 4.97 1 1 2,0011ubq wang 3.58 2.08 1 1 3011vcc tasser 7.49 4.87 1 1 1,2502cro 4state�reduced 4.77 2.60 1 2 675

lmds 3.90 4.72 1 1 5012ovo lmds 6.63 5.07 1 1 348

wang 4.93 2.45 1 1 3012rhe wang 8.37 3.02 1 1 2014pti 4state�reduced 4.96 2.79 1 4 689

lmds 4.88 1.72 1 14 344wang 3.24 2.07 1 1 301

PDB ID, PDB identifier of the native structure; decoy method, method used to generate the decoy (17, 19). Decoys with obvious errors such as large chain breaksor decoys whose PDB files contained nontrivial format errors were left out. ZT, Z value of TorusDBN; ZH, Z value of HMMSTR; RankT, rank of native accordingto TorusDBN; RankH, rank of native according to HMMSTR; N, total number of structures in the set.


73

Table S3. Results of applying TorusDBN and HMMSTR to 21 decoy sets generated by Rosetta

PDB ID Method ZT ZH RankT RankH N

1a32 rosetta07 �2.72 �1.33 119 97 120rosettab 2.67 0.33 2 716 1,611

1aa2 rosetta 2.76 2.64 3 4 1,0001acf rosetta 3.03 3.30 2 1 2,000

rosetta07 �2.26 1.61 119 3 1201ail rosetta07 0.89 �0.31 22 67 120

rosettab 6.17 1.78 1 51 1,8081cei rosetta07 �1.39 �0.98 112 105 120

rosettab 3.72 0.52 1 567 1,8981gvp rosetta 2.95 4.48 6 1 998

rosetta07 �0.53 1.60 82 6 1201kte rosetta 4.50 5.09 1 1 9991lzl rosetta 3.07 1.77 1 39 1,0001pgx rosetta07 0.70 1.41 30 8 1201tul rosetta �0.58 1.84 715 36 1,000

rosetta07 �1.11 �0.11 95 61 1201utg rosetta07 �0.05 0.01 71 63 120

rosetta�lee 2.09 0.47 1 11 311vcc rosetta07 0.86 0.46 24 35 1201who rosetta 3.50 3.73 1 1 1,000

rosetta07 1.78 1.59 4 4 120

PDB ID, PDB identifier of the native structure; decoy method, method used to generate the decoy (17, 19). Decoys with obvious errors such as large chain breaksor decoys whose PDB files contained nontrivial format errors were left out. ZT, Z value of TorusDBN; ZH, Z value of HMMSTR; RankT, rank of native accordingto TorusDBN; RankH, rank of native according to HMMSTR; N, total number of structures in the set.


74

Chapter 3

Monte Carlo sampling of proteins: local movesconstrained by a native-oriented structural prior

Wouter Boomsma, Sandro Bottaro, Thomas Hamelryck, Jesper Ferkinghoff-Borg

Unsubmitted Manuscript

75

Monte Carlo sampling of proteins: local moves constrained by anative-oriented structural prior

Wouter Boomsma∗, Sandro Bottaro†‡, Thomas Hamelryck∗, and Jesper Ferkinghoff-Borg†

∗Bioinformatics Centre, Department of Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark, †DTUElektro, Technical University of Denmark, 2800 Lyngby, Denmark, ‡Department of Physics, University of Milano, 20133 Milano, Italy

We propose a new local move method for Markov ChainMonte Carlo simulations of proteins. A general techniquefor integrating prior distributions into concerted-rotationtype local moves is presented, and the power of this ap-proach is demonstrated using a recently published modelof local protein structure (TorusDBN). Formal correctnessof the method is verified, and we show preliminary resultsindicating a significantly improved sampling efficiency.

Introduction

Markov Chain Monte Carlo (MCMC) is often the algo-rithm of choice for protein folding simulation studies. Bysimulating only the statistical properties of a system, ratherthan the detailed molecular dynamics, the technique al-lows for a rapid exploration of conformational space, mak-ing it possible to simulate at time scales characterizing thefolding process of proteins.

The construction of an MCMC algorithm involves de-signing a good set of moves. In contrast to the moves usedfor molecular dynamics simulations, the design of MCMCmoves is not constrained by the physical dynamics of thesystem, only by the requirement that detailed balance (mi-croscopic reversibility) is fulfilled. One of the most simpleMCMC moves is the pivot-move, where a single dihedralangle in the protein chain is modified at a time. While suchpoint-wise modifications of angular degrees of freedom ofthe backbone generally lead to an efficient exploration ofconformational space, the global changes caused by suchmoves are too dramatic for an efficient sampling aroundthe more densely packed native state. Many MCMC meth-ods therefore supplement the pivot-type moves with a localmove. These moves are designed to work within a smallsegment of the chain, keeping the positions of all atomsoutside the segment fixed.

A variety of local move methods have been proposed inthe literature. Go and Scheraga gave a theoretical solutionto the geometrical problem already in 1970 [1]. They for-mulated the constraints necessary to fix the positions of theatoms outside the region of the local move, and demon-strated that this gave rise to six equations, which couldbe reduced to a single equation in one variable. Doddand coworkers turned these ideas into an MCMC movecalled the concerted rotation local move, by working outthe necessary requirements for detailed balance [2]. Sev-

eral variants of this original approach have been proposed[3, 4, 5, 6]. Recently, the geometric problem was recastto a different form, where the solutions can be efficientlydetermined as roots of a polynomial equation [7, 8]. Fi-nally, it was demonstrated by Ulmschneider and Jorgensenthat the efficiency of the concerted rotation type of movecould be enhanced by including bond-angle degrees offreedom [9].

Along a different branch, the so-called configurationalbias local move methods were built on the idea of com-pletely regrowing a segment of a chain one atom at a time.While originally only an end-move [10, 11], this idea wasturned into a local move by Escobedo and de Pablo [12]and later extended in other studies incorporating variouslook-ahead strategies to decrease the number of rejectedgrowth attempts [13, 14]. Other recent studies use combi-nations of the two strategies. Uhlherr, for instance, uses aregrowth procedure for the positioning of the initial atomsin the segment, and then applies a concerted rotation typealgorithm to determine the final, more constrained, atompositions [15].

In general, the efficiency of an MCMC simulation canbe improved by choosing a proposal distribution that isclosely related to the limiting distribution of the Markovchain. For protein simulations, it would be ideal to sampledirectly from the Boltzmann distribution corresponding toa given energy function. Although this can not generallybe done, it is possible to design moves that encapsulatepart of the global energy function. The frequent use offragment assembly in the field of ab initio protein struc-ture prediction is an example of an attempt to bias thesearch towards candidates with good local structure. Dueto the non-probabilistic and discrete nature of fragment-assembly methods, it is, however, difficult to create properMCMC moves from such methods. Recently, we proposeda probabilistic model of local structure which representsa solution to this problem [16]. The model captures theangular preferences of the native state of proteins, and istherefore a natural component of an energy function for abinitio structure prediction. Using the model to resampleangles for a segment of a protein corresponds to a pivot-like move guided by a distribution that is also present as aterm in the global energy function, resulting in high prob-ability for acceptance.

In the current study, we investigate how to incorpo-

1

77

rate a structural prior like the TorusDBN into a localmove method. The combination of a local move and apivot-move guided by the same structural prior distributionwould provide a rigorous alternative to the use of fragmentassembly in ab initio structure prediction.

The Setup

We base our method on the ideas of the concerted rotationtype local move, which provides the most flexible frame-work for the incorporation of a prior distribution. Forthe choice of protein representation, we follow the leadof Ulmschneider and Jorgensen [9]. In the description oftheir CRA method, they demonstrate clear benefits whenbond angles are included as degrees of freedom in additionto the dihedral angles.

Like other concerted rotation inspired methods, we di-vide the local move problem into a prerotation and apostrotation step (Figure 1). During prerotation, new an-gles are proposed for a small segment of the chain, intro-ducing a break of the chain at the end of the segment. Therole of the postrotation step is then to find the necessarycompensating changes in the remaining degrees of free-dom of the chain in order to return to a closed state.

Since our representation includes bond-angles as de-grees of freedom, the postrotation step only involves thebreak-point residue and its immediate neighbor (See Fig-ure 1). To maintain a reasonable length of the move, thenumber of prerotation angles must be relatively high. Thismakes it necessary to introduce a strategy which limits thegap size introduced during prerotation, since large gapswill cause the postrotation to fail. One possible strategyto handle this problem was introduced by Favrin et al.[17]. By expressing the displacement of the last atom inthe move region (end point) as a first order polynomialin the change of angular variables, a multivariate Gaus-sian distribution was constructed which was biased towardsmall deviations at the end point. This distribution wascombined with an multivariate, uncorrelated Gaussian dis-tribution, using a weight parameter that determined the rel-ative strength of the bias. For small values of the param-eter, the proposed angular changes corresponded to inde-pendent Gaussians around the current angular values. Forlarge values of the parameter, the samples were heavilyconstrained to values that cause small displacement of the

end point. While the strategy was designed as a semi-localmove, Ulmschneider and Jorgensen used the approach fortheir prerotation step, ensuring a high success rate for theirpostrotation.

In probabilistic terms, the Gaussian used for regulariza-tion by Favrin et al. can be viewed as a prior distribution,where each degree of freedom is independently Gaussiandistributed, centered around the current conformation. Forour purpose, we propose a different prior distribution forthe prerotation step. Rather than being centered aroundthe current conformation, our distribution is an informa-tive prior modeling our knowledge about the local struc-ture of proteins. In the next section, we will present thisstructural prior in the context of the current application,and demonstrate how priors in general can be incorporatedinto concerted-rotation type moves.

For the postrotation, we use a similar scheme as previ-ous concerted rotation methods. However, we demonstratethat when bond-angles are included as degrees of freedom,this problem has a simple analytical solution. This signifi-cantly improves the efficiency of the method and makes itconsiderably easier to implement.

Throughout the paper, we will refer to the new loop clo-sure method as the CRISP method (Concerted Rotation In-cluding Structural Priors).

The algorithm

The TorusDBN structural prior

Recently, we developed a probabilistic model of localprotein structure. Simply stated, the model (TorusDBN)makes it possible to sample structures (or segments of astructure), given an amino acid and/or secondary structuresequence as input. The sampled structures will not, gener-ally, be globular proteins, but on a local scale their struc-ture will be realistic, with motifs such as β-strands andα-helices. The model can be understood as a probabilisticversion of the fragment assembly methods that have be-come increasingly important in the field of ab initio struc-ture prediction in the last decade [18, 19].

TorusDBN models the sequential signal in proteinsthrough a Markov chain of hidden states. Each of thesestates defines a distribution over (φi, ψi) dihedral anglepairs, amino acid labels and secondary structure labels.Sampling from the model, given for instance an amino

C

C

N

NCα Cα

Cα CN

CαCN

CN Cα


Chain break

Figure 1: Local move schematic for a 5 residue move, demonstrating the angular degrees of freedom affected by a local move. Atomsin grey maintain their position during the move. The dotted circles denote angular degrees of freedom (prerotation: white, postrotation:grey).

2

78

acid input sequence, involves two steps: (i) sampling ahidden state for each position, and (ii) sampling angle val-ues for each position given the current hidden state at thatposition. The first step can be implemented as a separateMCMC move, and we can therefore assume the hiddenstate sequence to be fixed during the execution of a sin-gle local move. In this case, the density function over asequence of dihedral pairs θ is simply the product of theprobabilities of the individual (φi, ψi) = (θi1, θi2) pairs.

PT (θ) =∏

i

fT (θi1, θi2|κi1, κi2, κi3, µi1, µi2)

where κi1, κi2, κi3 are the concentration parameters of theangular distribution associated with the hidden node valueat position i, (µi1, µi2) are the corresponding mean values,and fT (θ1, θ2) is the density function of a bivariate angulardistribution on the torus [16]

fT (θ1, θ2|κ1, κ2, κ3, µ1, µ2) =c(κ1, κ2, κ3) exp(κ1 cos(θ1 − µ1) + κ2 cos(θ2 − µ2)

− κ3 cos((θ1 − µ1)− (θ2 − µ2))). (1)

The TorusDBN only models the dihedral angles of aprotein backbone, disregarding the bond angles. In thecontext of the current application, we also consider bondangles as degrees of freedom with an associated probabil-ity distribution. Based on the ideal values for the mean andstandard deviation determined experimentally by Enghand Huber [20], we construct atom type dependent Gaus-sian distributions for this purpose. For ease of reference,when referring to the TorusDBN in the present study,we will implicitly include these Gaussian distributions.Specifically, let χ = {θ,b} denote the collection of allangular degrees of freedom where b are all backbone bondangles (i.e. b1 = ∠(N1C

α1 C1), b2 = ∠(Cα

1 C1N2), . . . ).The extended TorusDBN density is

PT (χ) = PT (θ)∏j

1σj

√2π

exp(−

(bj − b0j )2

2σ2j

). (2)

where the index j ranges over all bond-angles, b0j denotesthe ideal mean value and σj the standard deviation for theparticular bond angle at that position.

At high concentrations, we have cos(x) ' 1 − 1/2x2,and equation (1) can be approximated by a bivariate nor-mal distribution with inverse covariance matrix

Σ−1T '

(κ1 − κ3 κ3

κ3 κ2 − κ3

).

Consequently, equation (2) can be approximated by a mul-tivariate Gaussian

PG(χ) =det(Σ−1)(2π)n/2

exp(−1

2(χi − µi)Σ−1

ij (χj − µj))

where repeated indices denote an implicit summation (i.e.Einstein notation). Depending on the index i, µi refers

to the ideal value of a either a bond or a dihedral angle.Similarly, Σ−1 is the inverse covariance matrix

Σ−1ij =

κi1 − κi3 if i = j and i correspondsto a φ-angle

κi2 − κi3 if i = j and i correspondsto a ψ-angle

κi3 if i 6= j and i, j are dihedral anglesin the same residue

1σ2

iif i = j and i, j is a bond angle

0 otherwise

It will be convenient to rewrite the distribution into an ex-pression involving deviations δχi from the previous value.Let χ0 be the initial values of the degrees of freedom. Wedefine δχi = χi − χ0

i . Then

PG(δχ) =det(Σ−1)(2π)n/2

exp(−12(δχi−µ0

i )Σ−1ij (δχj−µ0

j ))

(3)where µ0

i = µi−χ0i equals the average displacement 〈δχi〉

from the reference configuration χ0i in an unbiased sam-

pling.

Prerotation

We will consider how the TorusDBN prior distribution ismost naturally altered to incorporate locality constraintson the anchor atoms at the end of the prerotation segment.The constraints are necessary in order to obtain a reason-able success rate in the postrotation step. This problem canbe conveniently solved using Jaynes’ maximum entropyprinciple, which is a constructive principle for incorporat-ing additional information into a probability distributionin a minimally biased way [21]. Formulated in this frame-work, the task is to find the distribution PT , that minimizesthe divergence to our prior distribution PT , under localityconstraints of the last three atoms.

The Kullback-Leibler divergence is given by:

DKL(PT |PT ) =∫PT (δχ) log(

PT (δχ)PT (δχ)

)dδχ

We formulate the locality constraints in terms of the ex-pected value of the squared distance of the atom positionsat the end of the prerotation segment. Although the posi-tions of the C and the N atom in Figure 2 are not updatedby the prerotation, we include as constraints their new po-sitions if they had been updated, thereby enforcing a con-straint on the orientation of the Cα atom (Figure 2).

We incorporate the constraints using the Lagrange for-malism, leading to a modified divergence expression thatwe wish to minimize

DKL(PT |PT ) = DKL(PT |PT ) +3∑

m=1

λm〈d2m〉

3

79

C

CN Cα

Cα CN

C

NC

N

d1

d2

d3

Cα

1 2 3

Figure 2: Constraints on the last atoms of the prerotation seg-ment. The positions of the C and N atoms in grey are not up-dated on a prerotation move, but a constraint on their positionis an effective constraint on the orientation of the preceding Cα

atom.

with m corresponding to the three anchor positions in fig-ure 2. Taking the functional derivative ∂DKL

∂PT (δ~χ)and setting

it to zero, we obtain

PT (δχ) =1ZPT (δχ) exp(−λm〈d2

m〉) (4)

where Z is the normalization constant. To first order, thedisplacements vector ~dm can be expressed as

~dm =∂~am

∂χi

∣∣∣∣~χ0i

δχi

where ~am is the position of the mth anchor atom (Figure2). Correspondingly, the square displacement is given by

d2m =

∂~am

∂δχiδχi ·

∂~am

∂δχjδχj = δχiI

mij δχj (5)

where Imij = ∂~am

∂δχi· ∂~am

∂δχjby construction is symmetric and

positive semi-definite. Equation (5) allows us to expressthe distribution P as a new multivariate Gaussian distribu-tion. Defining

Σij = Σij + 2λmImij (6)

and

µi = Σ−1ij Σjkµ

0k (7)

equation (4) takes the form

PT ≈ PG =

√det(Σ)

(2π)n/2exp

(−1

2(δ~χi−µi)Σij(δ~χj−µj)

)(8)

We treat the three anchoring positions indiscriminately,and set λ1 = λ2 = λ3 = λdd. By tuning the λdd parame-ter, we can adjust the weight of the locality constraint.

With PT approximated by a Multivariate Gaussian,samples for the prerotation can be obtained in the usualfashion, using the Cholesky decomposition of Σ.

Incidentally, we note that the derivation leading to equa-tion (8) holds for any multivariate Gaussian prior. In par-ticular, if one chooses a Gaussian prior with a diagonal

covariance matrix and centered around the current struc-ture, we have µ0

k = 0, and we obtain the expression for theprerotation distribution that is used in the CRA method [9]and the semi-local move by Favrin et al. [17]. This ver-ifies that the regularization procedure employed in thesetwo method can indeed be considered simply as a particu-lar choice of prior distribution.

Postrotation

In principle, the postrotation problem of our method isidentical to that of the CRA method, of which the origintraces back to the study by Go and Scheraga [1]. In theCRA method, this problem was solved numerically. How-ever, we demonstrate here that in the case of a represen-tation including bond angles, a simple analytic solution isavailable.

Figure 3 illustrates the degrees of freedom involved inthe postrotation. The leftmostN and Cα atoms are the lastpositions that are updated by the prerotation, and will re-main fixed during postrotation. By construction, the posi-tions of the rightmost N and Cα atoms (and the remainderof the chain) should be unaffected by the local move. Onlythe position of the C atom will be updated during postro-tation, resulting in new values for the bond angles α1, α2,α3, and dihedrals ω1, ω2 and ω4.

CN Cα

Cα

r1

r2

r3

Nq

p1

p2

p4

p3

ω1

α1

β

ω2 α

2 α3

ω4

Figure 3: The postrotation step. This step proceeds by calculat-ing the position of the C atom, from which all angular degrees offreedom can be determined.

Let i represent the atom number along the backbone,and ~ri be the corresponding position vectors relative to theorigin of the global coordinate system. By assumption, thelengths pi of all bond vectors ~pi = ~ri − ~ri−1 are known.Since ~r1 and ~r3 are known, the vector ~q = ~r3 − ~r1 is alsoknown. The dihedral around bond vector ~p3 is assumed tobe either 0 or π (corresponding to a cis or trans state),which implies a coplanarity of the bond vectors ~p2, ~p3 and~p4.

We define a orthonormal basis ~e1, ~e2, ~e3, as

~e1 =~q

q

~e2 =~p4 − (~p4 · ~e1)~e1|~p4 − (~p4 · ~e1)~e1|

~e3 = ~e1 × ~e2

The position ~r2 of the C atom can now be written as

~r2 = ~r1 + p2 cos(β)~e1 ± p2

√1− cos2(β)~e2 (9)

4

80

where β is the angle between ~p2 and ~q.In other words, to place the C atom, it is sufficient to

determine a value for cos(β). This can, however, easily bedone using the Law of cosines on the triangle formed by~p2, ~p3 and ~q.

cos(β) =q2 + p2

2 − p23

2qp2(10)

Equation (10) provides us with two possible positions.As originally demonstrated by Dodd and coworkers, mul-tiple solutions can be dealt with by calculation of all possi-ble reverse moves and a weighting scheme based on theirJacobian factors [2]. For efficiency and simplicity, theCRA method avoided this problem by enforcing toleranceson the changes in postrotational degrees of freedom, whichlead to a single unique postrotation solution. We proposethe even simpler approach of choosing the solution that hasthe same sign in equation (9) as the original structure. Thisstrategy is certainly consistent, it avoids arbitrary choicesof tolerance levels, and is slightly more efficient. An em-pirical study demonstrated that this choice is equivalent tothe CRA method in 97% of the cases (results not shown).

Monte Carlo scheme

When using the CRISP move in MCMC simulations, caremust be taken to ensure that the property of detailed bal-ance (microscopic reversibility) is fulfilled. Generally, fora move from conformation χ to χ′, the requirement takesthe form

π(χ)S(χ → χ′)dχ = π(χ′)S(χ′ → χ)dχ′ (11)

where π(χ) is the probability of being in state χ accord-ing the the equilibrium distribution, and S(χ → χ′) isthe probability of proposing state χ′ when currently instate χ. Let us consider a change involving a set ofprerotational angles χ0 and postrotational angles χ1 ={ω1, α1, ω2, α2, α3, ω4}. Letting u4 denote the unit vec-tor ~p4

|~p4 | , χ1 is a unique function of the constraints C ={~r3, ~u4, ω4} and ~r1 = ~r1(χ0), i.e. χ1 = χ1(χ0, C). Wemay therefore express the volume element dχ1 in terms ofdC,

dχ1 = dω1dα1dω2dα2dα3dω4 =∣∣∣∣ 1det(A(χ))

∣∣∣∣d~r3d~u4dω4

(12)where A is the 6× 6 matrix evaluated at χ = {χ0,χ1}

A=

d~r3dω1

d~r3dα1

d~r3dω2

d~r3dα2

d~r3dα3

d~r3dω4

du4,x

dω1

du4,x

dα1

du4,x

dω2

du4,x

dα2

du4,x

dα3

du4,x

dω4

du4,y

dω1

du4,y

dα1

du4,y

dω2

du4,y

dα2

du4,y

dα3

du4,y

dω4

0 0 0 0 0 1

Note that since ~u4 is a unit vector, only two of its compo-nents are included as degrees of freedom. Here, we arbi-trarily chose the x and y components. In practice, a checkshould be implemented to determine whether the two rowsare linearly dependent, in that case choosing the z compo-nent instead.

The derivatives of matrix A can be quite easily calcu-lated from the rotation vector of each of the angular de-grees of freedom. For a given angular degree of freedomχ, we define the rotation vector Ψ as

~Φ(χ) =

~w

wif χ is a dihedral along vector ~w

~w1 × ~w1

|~w1 × ~w2|

if χ is a bond angle deter-mined by two vectors ~w1 and~w2

The calculation of det(A) is slightly simplified by the factthat both ~r3 and ~u4 are independent of ω4, and that the lastcolumn and row of the determinant can thus be eliminated.The resulting 5 × 5 determinant matrix can be calculatedas

det(A) = det

~r3×Φ(ω1) ~r3×Φ(α1)(~u4×Φ(ω1))x (~u4×Φ(α1))x

(~u4×Φ(ω1))y (~u4×Φ(α1))y

~r3×Φ(ω2) ~r3×Φ(α2) ~r3×Φ(α3)(~u4×Φ(ω2))x (~u4×Φ(α2))x (~u4×Φ(α3))x

(~u4×Φ(ω2))y (~u4×Φ(α2))y (~u4×Φ(α3))y

Inserting Eq. (12) into Eq. (11) the detailed balance re-quirement can be written as

π(χ)S(χ → χ′)det(A(χ))

dχ0dC =π(χ′)S(χ′ → χ)

det(A(χ′))dχ′0dC.

Since the all pre-rotational variables are free, dχ0 = dχ′0,and the requirement for the transition probabilities S be-comes

π(χ)S(χ → χ′)det(A(χ))

=π(χ′)S(χ′ → χ)

det(A(χ′)).

which corresponds to the Metropolis-Hastings acceptanceprobability

Pa = min(

1,π(χ′)J(χ′)S(χ′ → χ)π(χ)J(χ)S(χ → χ′)

)(13)

where J(χ) is the Jacobian factor 1det(A(χ)) .

Outline of the algorithm With all ingredients inplace, we sum up the steps involved in one compete CRISPmove for a protein of length l, with fixed move size m.

(a) Select a start index s in [−(m− 1), l− 1], and set theend index t to t = s+m

(b) Prerotation.

5

81

• If the chosen range lies fully within the protein:Execute a prerotation move for[sCα , (t− 2)Cα [

• If the chosen range lies partially outside thechain:

Make an end move consisting of a directresampling of angles from the TorusDBNin the range [s, t[. Accept with probabilitymin(1, π(χ′)PT (χ)

π(χ)PT (χ′) ).If π(χ) contains the TorusDBN potential,the acceptance rate can be simplified byletting the identical terms cancel.

(c) Postrotation.

Find the new position for (t − 2)C , and update theangular degrees of freedom at (t − 2) and (t − 1) asdescribed in the previous section.

(d) Calculate the proposal bias S(χ′ → χ)/S(χ → χ′)and the Jacobian ratio J(χ′)/J(χ).

(e) Accept move with the probability given by Eq. (13).

One item in the procedure is worth highlighting. Notethat we are forced to compensate with the proposal distri-bution S in the acceptance probability. In the ideal case,if our proposal distribution had been exactly identical to afactor in the limiting distribution π, we would not need toevaluate either of these terms, as they would cancel. Un-fortunately, this is not the case. First of all, we recall thatthe prior PG(χ) in equation (3) is only an approximationto the actual distribution of the TorusDBN. Second, thisdistribution is modified to incorporate a locality constraint.Third, the postrotation step is deterministic, and we haveno control over the changes to these last angular degreesof freedom of each move. All these changes are howeversmall perturbations to the original distribution, and the ra-tio between S(χ → χ′) and the corresponding factor inπ(χ′) is expected to be quite small.

Results

Detailed balance

As a sanity check of the implementation of an MCMCmove, the detailed balance property should be verified.This is often evaluated by checking that, if no global en-ergy function is provided, the angular degrees of freedomare distributed uniformly. In the case of CRISP, it is morenatural to check the correctness of the method directly inthe context of the prior distribution of the TorusDBN. Wetherefore set π(x) = PT (x). If the method works asintended, asymptotically, the angular degrees of freedomshould be distributed exactly as if we had sampled directlyfrom PT (x).

Naturally, given their local nature, the CRISP moveswill converge quite slowly to the TorusDBN distribution.As a test case, we therefore only consider a short segmentof a protein. We chose a segment of protein G (2gb1,

position 31–46) that contains a helix-coil-strand motif, todemonstrate the differences in angular distribution for thevarious secondary structure regions.

Figure 4 illustrates the position-specific dihedral angledistributions for 2 × 109 conformations generated withboth CRISP (grey) and TorusDBN (white) moves. Foreach angular degree of freedom, a histogram was createdwith 16 bins. We see that the results are indeed similar,verifying that we have detailed balance in our sampling.

Efficiency

We proceed with a direct comparison to the CRA method.As we argued previously, the CRA method falls out as aspecial case of the prerotation derivation presented abovewhen using a simple prior around the current conforma-tion. A direct comparison between the two methods willtherefore reveal the effect of the TorusDBN prior on theefficiency of the local move method.

A good measure of the efficiency of an MCMC move isthe relaxation time of the system, which is equivalent tothe the correlation length of the generated conformations,and thereby the effective move size. The relaxation timecan be evaluated using a block averaging procedure, cal-culating the statistical inefficiency s.

s = limnb→∞

nbσ2(〈A〉b)σ2(A)

. (14)

where A is some observable property of our system, and〈A〉b denotes the average of this observable over a blockb. The statistical inefficiency can be understood as thecompensating factor that describes how much less effi-cient samples from an MCMC scheme are compared togenuinely independent samples. In general, if we had Mi

independent stochastic variablesAi with equal variance σ,the variance of their mean could be calculated as

σ2(〈A〉) =σ2(A)Mi

(15)

However, this assumes that the samples are independent.In the case of MCMC simulations, samples can generallyonly be considered independent if separated by a certaininterval. The statistical inefficiency is defined as the factorby which the number of samples M should be reduced inorder for equation (15) to hold

σ2(〈A〉) =σ2(A)

Ms

(16)

By splitting up the samples in a number of blocks, it ispossible to evaluate σ2(〈A〉) directly, and thereby obtainan estimate for s. The idea is to divide the total numberof samples M from the system into b blocks of size nb,and obtain estimates for the propertyA of interest for eachof these blocks. If nb is sufficiently large, the individualblock estimates 〈A〉b can be assumed to be independent,and the variance of the mean of these estimates is

6

82

0π

0π

2

0π

0π

3

0π

0π

4

0π

0π

5

φ

ψ

0π

0π

6

0π

0π

7

0π

0π

8

0π

0π

9

0π

0π

10

0π

0π

11

φ

ψ

0π

0π

12

0π

0π

13

0π

0π

14

0π

0π

15

φ

ψ

Figure 4: Angular histograms for a 16 amino acid long segment of the 2gb1 protein showing a helix-coil-strand motif (4 helix, 6 coil,4 strand) recorded over 2 × 109 moves using 16 bins. The white bars correspond to moves directly from the TorusDBN distribution,where random segments of angles are resampled, and all positions in the chain are updated. The grey bars represent CRISP moves usingTorusDBN as a prior.

σ2(〈A〉b) =1b

b∑i=0

(〈A〉i − 〈A〉)2

Inserting into equation (14), we can evaluate s. Note thatconvergence of s should be ensured by choosing nb suffi-ciently large.

We estimate the statistical inefficiency using a simu-lation on the same test chain as previously, a helix-coil-strand motif in protein G. As an observable in our case,we use the cosine of a dihedral angle in the middle of thesegment. We found that the statistical inefficiency has gen-erally converged after at most 100× 106 iterations.

The efficiency of the local move methods is dependenton the parameters of the method and the energy functionused during simulation. We use only the TorusDBN en-ergy for this comparison. The CRA method has three

tunable parameters C1, C2, C3. After verifying that thechoice of C2 and C3 has little effect on efficiency (resultsnot shown), we settled for the same values as in the origi-nal paper (C2 = 8, C3 = 20). The optimal efficiency forthe CRA method is then found by plotting the convergedstatistical inefficiency as a function of C1. A similar testwas done for CRISP. Here, we only have a single tunableparameter, which is the value of the Lagrange multiplierλdd, determining the force of the locality constraint. Fig-ure 5 illustrates how the statistical inefficiency for the twomethod depends on the choice of the C1, and λdd parame-ter, respectively.

We see that CRISP clearly has a higher optimal effi-ciency than CRA with this energy function. As expected,the incorporation of part of the stationary distribution as aproposal distribution improves the efficiency of the sam-

7

83

0 50 100 150

3000

040

000

5000

060

000

7000

0

s

CRISPCRA

λdd C1

Figure 5: Statistical inefficiency as a function of the tunable pa-rameters λdd and C1 in the two loop closure methods.

pling.We also note that the inefficiencies for the CRA method

presented here are significantly higher than that of the orig-inal paper. This is most likely due to different energy func-tions used. The extended TorusDBN energy has a quitenarrow bond-angle potential, simply given by a Gaussianwith mean and variance as described by Engh and Huber[20]. The OPLS-AA energy [22] used in the original pa-per is significantly broader, which leads to an increasedacceptance rate.

Discussion

We developed a new type of local MCMC move for proteinsimulations, incorporating a prior distribution that governsthe local structure of candidate structures. The use of aninformative prior is demonstrated to improve the samplingefficiency considerably. In addition, a new and simple an-alytical technique is presented for calculating the postro-tation step in cases where bond-angles are included as de-grees of freedom. This avoids the tedious numerical cal-culations that have been used previously [9].

The TorusDBN local model that was used in this studycan be viewed as a secondary structure dependent prior.As can be seen in figure 4, the variances of the angulardistributions of the TorusDBN depend heavily on the lo-cal structural context. Helices have very narrow distribu-tions, while coil states are rather flexible with broad dis-tributions. By the design of the CRISP method, this willautomatically affect the type of local move that will beproposed in these regions. In the core of a helix, struc-tural variations will be small, while coil regions will expe-rience larger deviations. We see this effect very clearly inFigure 6, which summarizes the structural variance over10,000 CRISP moves for the complete protein G (2gb1)structure. End moves were excluded in this figure to makethe structural alignment clearer. Note, in particular, howthe longest coil region sees the most dramatic changes, andhow the hairpins, although quite flexible, are more heavilyconstrained. We also see some flexibility in the strand re-

gions, which would probably be reduced if a global energyfunction with a hydrogen-bond term was included in thesimulation. These changes seem to reflect the dynamics ofthe protein quite well, although further studies should beconducted to clarify this in detail.

Acknowledgments

The authors thank Jes Frellsen and Tim Harder for discus-sions in the preparation of this paper. WB was supportedby the Lundbeck Foundation, TH was funded by Forskn-ingsrådet for Teknologi og Produktion ("Data driven pro-tein structure prediction").

References

[1] Go N, Scheraga H (1970) Ring closure and lo-cal conformational deformations of chain molecules.Macromolecules 3: 178–187.

[2] Dodd L, Boone T, Theodorou D (1993) A concertedrotation algorithm for atomistic Monte Carlo simu-lation of polymer melts and glasses. Mol Phys 78:961–996.

[3] Hoffmann D, Knapp E (1996) Polypeptide foldingwith off-lattice Monte Carlo dynamics: the method.Eur Biophys J 24: 387–403.

[4] Hoffmann D, Knapp E (1996) Protein dynamics withoff-lattice Monte Carlo moves. Phys Rev E 53: 4221–4224.

[5] Pant P, Theodorou D (1995) Variable connectiv-ity method for the atomistic Monte Carlo simulationof polydisperse polymer melts. Macromolecules 28:7224–7234.

[6] Mavrantzas V, Boone T, Zervopoulou E, TheodorouD (1999) End-bridging Monte Carlo: a fast algo-rithm for atomistic simulation of condensed phasesof long polymer chains. Macromolecules 32: 5072–5096.

[7] Wedemeyer W, Scheraga H (1999) Exact analyticalloop closure in proteins using polynomial equations.J Comput Chem 20: 819–844.

[8] Coutsias E, Seok C, Jacobson M, Dill K (2004) Akinematic view of loop closure. J Comput Chem 25:510–528.

[9] Ulmschneider J, Jorgensen W (2003) Monte Carlobackbone sampling for polypeptides with variablebond angles and dihedral angles using concerted ro-tations and a Gaussian bias. J Chem Phys 118: 4261–4271.

[10] Frenkel D, Mooij G, Smit B (1992) Novel schemeto study structural and thermal properties of continu-ously deformable molecules. J Phys Condens Matter4: 3053–3076.

8

84

[11] de Pablo J, Laso M, Suter U (1992) Simulation ofpolyethylene above and below the melting point. JChem Phys 96: 2395–2403.

[12] Escobedo F, de Pablo J (1995) Extended continuumconfigurational bias Monte Carlo methods for simu-lation of flexible molecules. J Chem Phys 102: 2636–2652.

[13] Chen Z, Escobedo F (2000) A configurational-biasapproach for the simulation of inner sections of lin-ear and cyclic molecules. J Chem Phys 113: 11382–11392.

[14] Wick C, Siepmann J (2000) Self-adapting fixed-end-point configurational-bias Monte Carlo methodfor the regrowth of interior segments of chainmolecules with strong intramolecular interactions.Macromolecules 33: 7207–7218.

[15] Uhlherr A (2000) Monte Carlo conformational sam-pling of the internal degrees of freedom of chainmolecules. Macromolecules 33: 1351–1360.

[16] Boomsma W et al. (2008) A generative, probabilisticmodel of local protein structure. Proc Natl Acad SciUSA 105: 8932–8937.

[17] Favrin G, Irbäck A, Sjunnesson F (2001) MonteCarlo update for chain molecules: Biased Gaussiansteps in torsional space. J Chem Phys 114: 8154–8158.

[18] Chikenji G, Fujitsuka Y, Takada S (2006) Shaping upthe protein folding funnel by local interaction: lessonfrom a structure prediction study. Proc Natl Acad SciUSA 103: 3141–3146.

[19] Jauch R, Yeo H, Kolatkar P, Clarke N (2007) Assess-ment of CASP7 structure predictions for templatefree targets. Proteins 69 Suppl 8: 57–67.

[20] Engh RA, Huber R (1991) Accurate bond and angleparameters for X-ray protein structure refinement.Acta Crystallogr A 47: 392–400.

[21] Jaynes E (2003) Probability theory: the logic of sci-ence (Cambridge University Press).

[22] Jorgensen W, Maxwell D, Tirado-Rives J (1996) De-velopment and testing of the OPLS all-atom force-field on conformational energetics and properties oforganic liquids. J Am Chem Soc 118: 11225–11236.

Figure 6: The structural variance induced by 10,000 local moves on the native structure of 2gb1. The width of the chain indicates theposition-specific structural variance. The consensus structure was built from 1000 structures (every 100th sample was recorded).

9

85

Appendix A: Parameter Estimationand Sampling for the TorusDistribution

The cosine variant of the bivariate von Mises distribution (torus distribution) has densityfunction

fc(φ, ψ) = Cc exp(κ1 cos(φ−φ0)+κ2 cos(ψ−ψ0)−κ3 cos((φ−φ0)− (ψ−ψ0)).

As demonstrated in the introduction, for high concentrations, it can be approximated bythe bivariate Gaussian distribution

fc(φ, ψ) ' Cc exp(κ1 + κ2 − κ3) exp(−12((κ1 − κ3)(φ− φ0)

2+

(κ2 − κ3)(ψ − ψ0)2+

2κ3(φ− φ0)(ψ − ψ0))).

(13)

Parameter EstimationThere is no convenient analytical expression for the maximum likelihood estimator ofthe bivariate von Mises cosine distribution. We are therefore forced to numericallymaximize the likelihood (or log-likelihood) function. For a data set of n angle-pairs, thelog-likelihood is

LL(φ, ψ, κ1, κ2, κ3) = n log(Cc) +n∑i=1

κ1 cos(φ− φ0) + κ2 cos(ψ − ψ0)−

κ3 cos((φ− φ0)− (ψ − ψ0)) (14)

The normalization constantCc is a function of the parametersCc = Cc(φ0, ψ0, κ1, κ2, κ3).Again, no analytical expression for this value is available. However, we see that themarginal distribution density of a single angle shares the same normalization constant [1]

fc(φ) = Cc2πI0

(√κ2

1 + κ23 − 2κ1κ3 cos(φ− φ0)

)exp(κ2 cos(φ− φ0))) (15)

and Cc can thus be determined by numerically integrating a single-variable expressionfrom −π to π. The optimization of (14) was conducted using Powell’s method (seenumerical recipes [2]). It should be noted that since we are optimizing the parameters

87

88

φ0, ψ0, κ1, κ2, κ3, the normalization constant must be reevaluated for every evaluationof (14), which makes it a rather tedious process.

To speed up the parameter estimation procedure, I experimented with using method ofmoment estimators instead of maximum likelihood estimators. The moment estimatesfor φo and ψo are simply the directional sample means (accounting for the periodicityof the data [3]),

φo =

arctan

(SφCφ

)if Cφ ≥ 0

arctan

(SφCφ

)+ π if Cφ < 0

where Sφ and Cφ are the sample means for cos(φ) and sin(φ), respectively.I did not succeed in finding finding an expression for the second circular moments

(cos2 and sin2) directly from the density function. However, using the small angleapproximation x ' sin(x), we can write the covariance matrix for a bivariate Gaussiandistribution as

Σ '[

var(sin(φ)) cov(sin(φ), sin(ψ))cov(sin(φ), sin(ψ)) var(sin(ψ))

]and set it equal to the covariance matrix corresponding to the Gaussian approximationfor the cosine model (equation (13)). This gives us the moment estimates

κ1 =Sψψ − Sφψ

SφφSψψ − S2φψ

κ2 =Sφφ − Sφψ


κ3 =Sφψ


where Sφφ, SψψSφψ are the circular moments

Sφφ = 1/n∑i

sin2 φi

Sψψ = 1/n∑i

sin2 ψi

Sφψ = 1/n∑i

sinφi sinψi

Clearly, these estimates are not ideal. They rely on the small angle (Gaussian) ap-proximation, which is not generally valid for our purposes (especially during modeltraining). However, they can be used as starting points for the numerical maximumlikelihood optimization described above, thereby speeding up the parameter estimationconsiderably.

89

Finally, it should be noted that this variant of the bivariate von Mises distribution isnot necessarily unimodal, as one might expect. The necessary conditions for bimodalitywere determined by Mardia and coworkers to be

κ3 >κ1κ2

κ1 + κ2

(16)

when κ1 > κ3 > 0 and κ2 > κ3 > 0 [1]. Using the following reparameterizationscheme during parameter estimation, this situation was avoided, thus ensuring uni-modality

κ′1 = |κ1|κ′2 = |κ2|

κ′3 = (1− exp(−α|κ3|)) ∗min(κ′1, κ′2,

κ′1κ′2

κ′1 + κ′2)

where α is a free parameter, which I manually tuned to the value 0.001 in our parameterestimation scheme.

SamplingSampling a (φ, ψ) angle pair from the torus distribution is done by first using a rejectionsampler for the marginal distribution f(φ), giving us a sample for φ. The ψ angle is thensampled from the conditional distribution f(ψ|φ). This can be done efficiently since theconditional distribution is a univariate von Mises distribution for which direct samplingmethod exists [4].

A rejection sampler requires an envelope distribution, from which samples can bedrawn directly. Given a target distribution density f(x), the envelope distribution g(x)must fulfill the following requirement

f(x) < Kg(x)

for some constant K. Samples x are drawn from g(x), and accepted based on theprobability f(x)

Kg(x). Clearly, the efficiency of this type of sampler depends on the fit of

the envelope to the target distribution.The marginal distribution f(φ) (equation (15)) is not von Mises [1]. However, the von

Mises distribution is still an obvious choice for an envelope distribution. To determinethe optimal concentration parameter κ for the envelope distribution, another numeri-cal optimization is necessary, again accomplished by the Powell optimization scheme.For each different κ value, the optimal value for K is found by calculating the ratiof(x)/g(x) over the range from −π to π at intervals of 0.01, returning the maximumvalue of this ratio to the optimization scheme. This minimizes the maximum differ-ence between the two distributions, automatically returning the value K as the optimalfunction value upon termination.

As mentioned, there are cases where f(φ, ψ) is bimodal. Correspondingly, f(φ) is notalways unimodal. The conditions under which this occurs have been described in detail

90

by Mardia and coworkers [1]. For sampling purposes, in the bimodal case, it becomesnecessary to use a mixture of two von Mises distributions as the envelope distribution.These mixture components should have their means at the two modes. The locationof these modes can be found by differentiating (15) and finding the roots. However,since we restricted ourselves to unimodal distributions during parameter estimation, thissituation does not arise for the TorusDBN.

Once a φ angle has been sampled, the corresponding ψ angle can be sampled from avon Mises distribution M(µ, κ), with [1]

µ = arctan

(−κ3 sin(ψ − φ0)

κ1 − κ3 cos(ψ − φ0)

)κ =

√κ2

1 + κ23 − 2κ1κ3 cos(φ− φ0)

As an alternative to the sampling strategy described here, a Gibb’s sampler could beconstructed. Since both conditional distributions f(ψ|φ) and f(φ|ψ) are simple vonMises distributions, such a strategy would be much easier to implement than the ap-proach presented above. However, for the model presented in Chapter 2, we requireefficient sampling, which an iterative approach such as the Gibbs sampler cannot pro-vide. Although the sampling procedure described in this section involves a significantamount of numerical optimization, it is important to realize that this only has to be doneonce for each parameter setting. For fixed parameter values, the parameters for thecorresponding envelope distributions can be precalculated and saved, and subsequentsampling can be done extremely efficiently.

Bibliography[1] Mardia KV, Taylor CC, Subramaniam GK (2007) Protein bioinformatics and mix-

tures of bivariate von Mises distributions for angular data. Biometrics 63: 505–512.

[2] Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical Recipesin C: The Art of Scientific Computing (Cambridge University Press, New York, NY,USA).

[3] Mardia K, Jupp P (2000) Directional statistics (Wiley New York).

[4] Best D, Fisher N (1979) Efficient simulation of the von Mises distribution. ApplStatist 28: 152–157.

Appendix B: TorusDBN MCMCSampling Strategies

There are several ways to incorporate the TorusDBN local model as part of an MCMCprotein folding simulation. Here I describe two such strategies. The first is the stan-dard approach, were the contribution of the local model to the complete energy involvessumming over all hidden node sequences. In the second strategy, the hidden node se-quence is considered part of the state, and all probability evaluations of the local modelare done conditioned on the current hidden node sequence. In this scenario, the hiddenstate must be separately resampled.

Notation:PL(x): TorusDBN local probability of angle sequence x, summing over all

hidden node sequences. Any dependency on amino acids or secondarystructure is not specified explicitly.

PL(x|h): TorusDBN local probability of angle sequence x, given hidden nodesequence h. Any dependency on amino acids or secondary structureis not specified explicitly.

PG(x): The non-local global contributions to the energy function for anglesequence (structure) x.

xa:b: Sub-sequence of a vector x ranging from index a to b.xC : Complement of xa:b, i.e. the elements of x not included in xa:b.

Strategy 1

This strategy assumes that the state in the simulation is defined only by the current anglesequence. For any move from state x to x′, the detailed balance requirement states

P (x)P (x → x′) = P (x′)P (x′ → x) (17)

If we split the probability P (x → x′) into a proposal (selection) probability S(x → x′)and an acceptance probability A(x → x′), we have

P (x)S(x → x′)A(x → x′) = P (x′)S(x′ → x)A(x′ → x) (18)

which gives us the following acceptance ratio

A(x → x′)

A(x′ → x)=P (x′)S(x′ → x)

P (x)S(x → x′)(19)

91

92

Assuming independence between local and non-local probability terms, the probabil-ity of a structure can be written as a product of the local and global contributions.

P (x) = PL(x)PG(x) (20)

and the acceptance ratio can thus be written as

A(x → x′)

A(x′ → x)=PL(x′)PG(x′)S(x′ → x)

PL(x)PG(x)S(x → x′)(21)

Resampling angles and hidden node sequenceNaturally, it is possible to resample the entire angle sequence from the local model. Thisis done by first resampling a hidden node sequence according to P (h), and subsequentlyresampling the angles according to P (x|h). The proposal distribution is

S(x → x′) = PL(x′) =∑h

PL(x′|h)P (h) (22)

and the acceptance ratio becomes

A(x → x′)

A(x′ → x)=PG(x′)

PG(x)

PL(x′)

PL(x)

PL(x)

PL(x′)=PG(x′)

PG(x)(23)

As expected, the acceptance probability is simply the ratio of the global energy in thenew and old configuration.

Using the forward backtrack algorithm, it is possible to resample only a subsequenceof the angles. We let hC denote the hidden node sequence that is not resampled, anddemonstrate, that for any such sequence, the detailed balance equation

P (x|hC)P (x → x′|hC) = P (x′|hC)P (x′ → x|hC) (24)

is fulfilled. The proposal distribution can be written as

S(x → x′|hC) = PL(x′a:b|xC ,hC) =

∑ha:b

PL(x′a:b|ha:b)PL(ha:b|hC) (25)

and P (x|hC) reads

P (x|hC) = PG(x)PL(xa:b,xC |hC)

= PG(x)∑ha:b

PL(xa:b,xC |ha:b,hC)PL(ha:b|hC)

= PG(x)∑ha:b

PL(xa:b|ha:b)PL(xC |hC)PL(ha:b|hC)

= PG(x)PL(xC |hC)∑ha:b

PL(xa:b|ha:b)PL(ha:b|hC)

(26)

93

This gives us the following acceptance probability

A(x → x′)

A(x′ → x)=PG(x′)

PG(x)

PL(xC |hC)∑

ha:bPL(x′

a:b|ha:b)PL(ha:b|hC)

PL(xC |hC)∑

ha:bPL(xa:b|ha:b)PL(ha:b|hC)

∑ha:b

PL(xa:b|ha:b)PL(ha:b|hC)∑ha:b

PL(x′a:b|ha:b)PL(ha:b|hC)

=PG(x′)

PG(x)

∑ha:b

∑ha:b

PL(x′a:b|ha:b)PL(ha:b|hC)PL(xa:b|ha:b)PL(ha:b|hC)∑

ha:b

∑ha:b

PL(xa:b|ha:b)PL(ha:b|hC)PL(x′a:b|ha:b)PL(ha:b|hC)

=PG(x′)

PG(x)

∑ha:b

∑ha:b

PL(x′a:b|ha:b)PL(ha:b|hC)PL(xa:b|ha:b)PL(ha:b|hC)∑

ha:b

∑ha:b

PL(xa:b|ha:b)PL(ha:b|hC)PL(x′a:b|ha:b)PL(ha:b|hC)

=PG(x′)

PG(x)(27)

which again, is just the ratio of the global energy terms.

Strategy 2

In some cases, it is convenient to treat the hidden sequence of the model as an explicitpart of the state. This makes it possible to create separate moves for updating the hiddenstate sequence and the angle sequence. In particular, the strategy allows us to assumea fixed hidden state sequence when designing a move. This assumption is used in thedesign of the CRISP method, presented in Chapter 3.

Considering the detailed balance properties for this scenario, we have

P (x,h)P ((x,h) → (x′,h′)) = P (x′,h′)P ((x′,h′) → (x,h)) (28)

which gives rise to the following acceptance criterion

A((x,h) → (x′,h′))

A((x′,h′) → (x,h))=P (x′,h′)S((x′,h′) → (x,h))

P (x,h)S((x,h) → (x′,h′))(29)

I now explore the acceptance probabilities of various types of moves that can beimplemented using this scheme.

Resampling the hidden node sequence

First, we consider a move that resamples a sub-sequence of the hidden node sequencegiven the current angle sequence. We let ha:b denote the sub-sequence of h from index

94

a to b, and hc the remainder of the chain. Since x′ = x, the acceptance ratio becomes

A((x,h) → (x′,h′))

A((x′,h′) → (x,h))=P (x,h′)S((x,h′) → (x,h))

P (x,h)S((x,h) → (x,h′))

=P (x|h′)P (h′)

P (x|h)P (h)

S((x,h′) → (x,h))

S((x,h) → (x,h′))

=PL(x|h′)PG(x|h′)P (h′)

PL(x|h)PG(x|h)P (h)

S((x,h′) → (x,h))

S((x,h) → (x,h′))

The global probability factor cancels since PG(x|h′) = PG(x)

=PL(x|h′

a:b,hC)PL(h′a:b,hC)

PL(x|ha:b,hC)PL(ha:b,hC)

S((x,h′a:b,hC) → (x,ha:b,hC))

S((x,ha:b,hC) → (x,h′a:b,hC))

With the proposal distribution S((x,ha:b,hC) → (x,h′a:b,hC)) = PL(h′

a:b|x,hC), wehave

=PL(x|h′



PL(ha:b|x,hC)

PL(h′a:b|x,hC)

=PL(x|h′



PL(x|ha:b,hC)PL(ha:b|hC)/PL(x|hC)

PL(x|h′a:b,hC)PL(h′

a:b|hC)/PL(x|hC)

=PL(x|h′



PL(x|ha:b,hC)PL(ha:b,hC)/PL(hC)

PL(x|h′a:b,hC)PL(h′

a:b,hC)/PL(hC)

= 1

(30)

all moves of this type will thus be accepted.If we wish to resample a hidden node sub-sequence without conditioning on the cur-

rent angle values (i.e. with proposal distribution S((x,ha:b,hC) → (x,h′a:b,hC)) =

PL(h′a:b|hC)), the acceptance ratio becomes

A((x,h) → (x′,h′))

A((x′,h′) → (x,h))=PL(x|h′



PL(ha:b|hC)

PL(h′a:b|hC)

=PL(x|h′



PL(ha:b,hC)

PL(h′a:b,hC)

=PL(x|h′

a:b,hC)

PL(x|ha:b,hC)

=∏i∈a:b

PL(xi|h′i)PL(xi|hi)

(31)

Resampling the hidden nodes unconditionally can thus be done, but this merely movesthe dependency on the current angles from the proposal distribution to the acceptanceprobability of the move.

95

Resampling the angle sequenceThe angles can be resampled in two ways, either based on the current hidden node se-quence (P (x|h)), or simultaneaously with a resampling of the hidden nodes (P (x,h)).We start with the first case. However, the latter is probably most efficient, and will becovered in the next section.

The move resamples a sub-sequence xa:b of the angle sequence x, given the currenthidden node sequence h. The proposal distribution is:

S((xa:b,xC ,h) → (x′a:b,xC ,h)) = PL(x′

a:b|xC ,h) = PL(x′a:b|h) (32)

The corresponding acceptance ratio is:

A((x,h) → (x′,h))

A((x′,h) → (x,h))=P (x′,h)S((x′,h) → (x,h))

P (x,h)S((x,h) → (x′,h))

=PG(x′)

PG(x)

PL(x′a:b,xC ,h)

PL(xa:b,xC ,h)

PL(xa:b|h)

PL(x′a:b|h)

=PG(x′)

PG(x)

PL(x′a:b|xC ,h)P (xC ,h)

PL(xa:b|xC ,h)P (xC ,h)

PL(xa:b|h)

PL(x′a:b|h)

=PG(x′)

PG(x)

(33)

since P (xa:b|xC ,h) = P (xa:b|h). Again, we see that only the global factors remain inthe acceptance ratio.

Resampling angles and hidden node sequenceResampling angles and hidden nodes simultaneously in one move is likely to give thelargest conformational changes, and is therefore the most efficient candidate for a prob-abilistic pivot move.

The proposal distribution of the move is:

S((xa:b,xC ,ha:b,hC) → (x′a:b,xC ,h

′a:b,hC)) = PL(x′

a:b,h′a:b|xC ,hC) = PL(x′

a:b,h′a:b|hC)(34)

Also for this move, the acceptance rate is simply the ratio of the global probabilities:

A((x,h) → (x′,h′))

A((x′,h′) → (x,h))=P (x′,h′)S((x′,h′) → (x,h))

P (x,h)S((x,h) → (x′,h′))

=PG(x′)

PG(x)

PL(x′a:b,xC ,h

′a:b,hC)

PL(xa:b,xC ,ha:b,hC)

PL(xa:b,ha:b|xC ,hC)

PL(x′a:b,h

′a:b|xC ,hC)

=PG(x′)

PG(x)

PL(x′a:b,h

′a:b, |xC ,hC)P (xC ,hC)

PL(xa:b,ha:b|xC ,hC)P (xC ,hC)

PL(xa:b,ha:b|xC ,hC)

PL(x′a:b,h

′a:b|xC ,hC)

=PG(x′)

PG(x)(35)

Appendix C: TorusDBN NetworkTopology

1

4

5

37

41

6

20

55

18

51

2

31

49

23

24

36

42

1214

3

3854

39

45

26

21

28

44

19

16

34

47

507

4052

8

35

9

10

43

30

53

33

11

15

13

48

46

17

32

22

25

27

29

Figure C.1: The state diagram of the TorusDBN. All transitions with probability > 5%are shown. The states are colored according to secondary structure preference (theemission label with highest probability): Helix (red), Strand (yellow) and Coil(green).

97

Concluding Remarks

Since the first CASP experiment was held in 1994, the field of protein structure predic-tion has advanced dramatically. In the case of ab initio structure prediction, the fieldhas moved from producing virtually random structures in the first CASP experiments,to frequent predictions of meaningful structures for small proteins in the latest exper-iments [1]. The invention of fragment-assembly has played a dominating role in thisdevelopment [2, 3]. The Rosetta method, by the Baker group, is perhaps the best ex-ample of this progress. Apart from consistently performing well in CASP, Baker andcoworkers have in the last decade developed successful method for homology model-ing [4], docking [5], and protein design [6], all using the idea of assembling fragmentsof local structure.

Many of the current fragment assembly methods are based on optimization tech-niques such as simulated annealing [7]. It is problematic to formulate a correct MarkovChain Monte Carlo (MCMC) simulation scheme based on fragment assembly, becausethe property of detailed balance cannot be ensured. If this requirement is ignored, thestationary distribution of the Markov chain will not correspond to the specified energyfunction. Effectively, some unquantifiable additional energy term corresponding to thelocal structural bias of the fragments will implicitly be present in the simulation.

One of the main goals of my Ph.D. has been to investigate how local structural biascould be incorporated in MCMC simulations in a rigorous manner. The result is a prob-abilistic model that fulfills two goals: (i) It encompasses much of the knowledge thatwe have about the local structure of proteins, and the underlying probability distributionis therefore a natural component of an energy function; (ii) It is a generative model,that can be used to (re)sample structures or segments of a structure, much like the frag-ment insertion technique of fragment assembly based methods. This dual nature ofthe model makes it ideal as a proposal distribution in an MCMC scheme. In contrast tofragment-assembly based methods, the effect of the proposal distribution can be directlyquantified.

The angle resampling scheme of the TorusDBN corresponds to a pivot-like move. Itis well known that such moves are typically inefficient around the densely packed nativestate, because many proposed conformations will have clashes. Local moves are oftenintroduced as a solution to this problem. Also here, fragment based approaches arecommon, but again do not provide a satisfying solution for MCMC simulations becauseof the lack of detailed balance. As demonstrated in Chapter 3, the probabilistic natureof the TorusDBN can also in this case provide us with a potential solution.

In the last months, the TorusDBN has been incorporated into a full protein predictionframework, based on the Muninn MCMC method developed by Jesper Ferkinghoff-Borg [8]. This method is currently being used to predict structures for a number of

99

100

targets in the ongoing CASP exercise. Undoubtedly, this work will lead to further in-sights into the full potential of the methods presented in this thesis. However, already atthis point, several topics for further studies have suggested themselves.

One obvious disadvantage of the TorusDBN is its dependency on a secondary struc-ture input. It is clear from Chapter 2 that the performance of the method is significantlyenhanced when a predicted secondary structure signal is included, compared to usingthe amino acid sequence alone. This additional signal seems to originate from the factthat secondary structure prediction methods typically use multiple sequence informa-tion in their prediction (See Chapter 2 – Supporting Information). For a given targetsequence, the sequence databases are searched for close homologues, resulting in mul-tiple sequence alignments that contain valuable information on the sequence variabilityat each position in the chain. It would be a great advantage if information of this typecould be included in the TorusDBN directly during training. Currently, the data set usedfor training has been specifically constructed to exclude homologues sequences withinthe same fold [9]. A first step towards resolving this issue could therefore simply be toretrain the model on a dataset with more sequence variability for a given structure.

Several extensions to the model are currently being implemented. The most obviousis perhaps the inclusion of a measure of solvent exposure at each position, which shouldgive the model a better chance at capturing for instance amphipathic signals along thechain. Also the inclusion of additional backbone angular degrees of freedom in themodel has been discussed, effectively incorporating the Gaussian bond-angle distribu-tions presented in Chapter 3 directly in the model.

I have focused on the sampling properties of the TorusDBN in this dissertation. Themodel has a few other areas of application that are worth exploring in the future. It seemsthat the model can convincingly distinguish native from non-native structures in manyof the well-known decoy datasets (Chapter 2 - Supporting Information). Decoy sets aredesigned as a benchmark for energy functions, and it is remarkable that a model basedonly on local structure performs so well. It has been argued that the model is simplypicking up artifacts from the way that these decoys have been constructed. And indeed,if the decoys are generated by slowly diverging away from the native state, one mightexpect the quality of local structure to decrease gradually as the distance to the nativestate becomes larger. However, the result does illustrate two points: (i) that the localstructure treatment of some of the existing methods is not of sufficient quality, and (ii)that the TorusDBN is capable of detecting structures with poor local structure. Usinga position-specific probability score, the model might be used as a tool for detectingerrors in experimentally determined structures, where it should have an advantage overthe use of simple Ramachandran plots to detect outliers.

Bibliography

[1] Moult J (2006) Rigorous performance evaluation in protein structure modellingand implications for computational biology. Philos Trans R Soc Lond B Biol Sci361: 453–458.

101

[2] Chikenji G, Fujitsuka Y, Takada S (2006) Shaping up the protein folding funnel bylocal interaction: lesson from a structure prediction study. Proc Natl Acad Sci USA103: 3141–3146.

[3] Jauch R, Yeo H, Kolatkar P, Clarke N (2007) Assessment of CASP7 structurepredictions for template free targets. Proteins 69 Suppl 8: 57–67.

[4] Misura K, Chivian D, Rohl C, Kim D, Baker D (2006) Physically realistic homologymodels built with rosetta can be more accurate than their templates. Proc Natl AcadSci USA 103: 5361–5366.

[5] Wang C, Bradley P, Baker D (2007) Protein–protein docking with backbone flexi-bility. J Mol Biol 373: 503–519.

[6] Kuhlman B, Dantas G, Ireton G, Varani G, Stoddard B Baker D (2003) Design of anovel globular protein fold with atomic-level accuracy. Science 302: 1364–1368.

[7] Kirkpatrick S, Gelatt C, Vecchi M (1983) Optimization by simulated annealing.Science 220: 671–680.

[8] Ferkinghoff-Borg J (2002) Optimized Monte Carlo analysis for generalized ensem-bles. Eur Phys J B Condens Matter 29: 481–484.

[9] Van Walle I, Lasters I, Wyns L (2005) SABmark–a benchmark for sequence align-ment that covers the entire known fold space. Bioinformatics 21: 1267–1268.

AcknowledgmentsI would like to take this opportunity to thank all the people who have helped me invarious ways during my Ph.D.

First an foremost, Thomas Hamelryck, without whom none of the work in this dis-sertation would have been possible. He has been a truly great supervisor. Thanks forinviting me aboard those many years ago, for the enthusiasm and inspiration, for al-ways having time for a good discussion, and of course, for teaching me the ways of thesamurai scientist.

Anders Krogh, for hiring me on the spot after our first meeting, and giving me theopportunity to do a Ph.D. in his group. Thanks for the guidance and inspiration, and forproviding the optimal working conditions for a Ph.D. student.

Jesper Ferkinghoff-borg, for being a fantastic host and supervisor during my time atDTU, and for the infinite amounts of enthusiasm and optimism that he always brings toany scientific challenge.

Our collaborators in Leeds, in particular Kanti Mardia for introducing me to the bi-variate von Mises distribution, and for the many helpful email discussions.

The entire structure group at the Bioinformatics Centre, for the great discussions onanything from structure prediction to beer. Mikael Borg and Thomas Hamelryck, andof course the other Ph.D. students: Jes Frellsen, Kasper Stovgaard and Tim Harder. Jesin particular for always being available for a scientific discussion on messenger and forproof-reading a large part of my dissertation.

All the other people at the Bioinformatics Centre, in particular Ida Moltke for the teaand the talks, and Kasper Munch for the classic movie afternoons in Cinemateket in theearly stages of my Ph.D., and for the many snails we shared during coffee breaks.

The students at DTU: Sandro Bottaro, Lucia Ferrari and Kristoffer Johansson, it hasbeen a pleasure working with you.

Eske Willerslev and Rasmus Nielsen for letting me participate in an interesting sideproject on Ancient DNA, involving a crash course in phylogeny and resulting in a nicepublication for the CV.

Also a thanks to my dear grandmother for the regular telephone calls reminding menot to over-do it, and my father, mother and sister for their valuable support throughoutthe years; in particular my father for his everlasting bag of Ph.D. tips & tricks.

Last, but by no means least, I would like to thank my wonderful girlfriend Trine, forher tremendous patience the last few months, for her ability to always make be smile,for reading through the entire dissertation, and for demonstrating that the pedantic andcautious approach of a mathematician occasionally has its merits.

Funding The first year of my Ph.D. was funded by the Lundbeck Foundation. Thisgrant has been of great importance for the realization of this Ph.D.

103

Curriculum Vitae

Personal InformationName: Wouter BoomsmaDate of Birth: 17 March, 1979Gender: MaleNationality: Dutch

Education & Employment

2005–2008 Ph.D. student, Bioinformatics Centre, Department of Biology, Univer-sity of Copenhagen. Supervisors: Thomas Hamelryck and AndersKrogh

2004–2005 Research Assistant, Bioinformatics Centre, Department of Biology, Uni-versity of Copenhagen

2001–2004 Masters degree in computer science, University of Aarhus1999–2003 Employment as a student counselor, Faculty of Science, University of

Aarhus1997–2001 Bachelor degree in computer science (with physics), University of

Aarhus1994–1997 Marselisborg Gymnasium, Aarhus

Publications

Publications during Ph.D. (not included in dissertation)

K. Munch, W. Boomsma, J. Huelsenbeck, E. Willerslev & R. Nielsen. Statistical as-signment of dna sequences using bayesian phylogenetics. System. Biol.. In press.

E. Willerslev, E. Cappellini, W. Boomsma, R. Nielsen, M. B. Hebsgaard, T. B. Brand,M. Hofreiter, M. Bunce, H. N. Poinar, D. Dahl-Jensen, S. Johnsen, J. P. Stef-fensen, O. Bennike, S. Funder, J-L. Schwenninger, R. Nathan, S. Armitage, J.Barker, M. Sharp, K. E. H. Penkman, J. Haile, P. Taberlet, M. T. P Gilbert, A.Casoli, E. Campani & M. J. Collins (2007). Ancient Biomolecules from Deep IceCores Reveal a Forested Southern Greenland. Science, 317: 111–114.

W. Boomsma, J.T. Kent, K.V. Mardia, C.C. Taylor & T. Hamelryck (2006) Graphi-cal models and directional statistics capture protein structure. Interdisciplinary

105

106

Statistics and Bioinformatics, eds S. Barber, P.D. Baxter, K.V.Mardia, & R.E.Walls, pp. 91–94.

Publications prior to Ph.D.René Thomsen and Wouter Boomsma (2004) Multiple Sequence Alignment Using

SAGA: Investigating the Effects of Operator Scheduling, Population Seeding, andCrossover Operators, Proceedings of EvoWorkshops 2004 (EvoBIO), Springer’sLecture Notes in Computer Science, 3005: 113–122.

Wouter Boomsma (2004) An Investigation of Adaptive Operator Scheduling Methodson the Traveling Salesman Problem, Proceedings of EvoCOP 2004, Springer’sLecture Notes in Computer Science, 3004:31–40.

Wouter Boomsma (2003) Using adaptive operator scheduling on problem domainswith an operator manifold: Applications to the Travelling Salesman Problem,Proceedings of the 2003 Congress on Evolutionary Computation (CEC2003),2:1274–1279.

PresentationsWouter Boomsma (2008), A generative, probabilistic model of protein structure, Ma-

chine Learning in Structural Bioinformatics (MLSB08). Talk.

Wouter Boomsma (2007), A graphical model of the local structure of proteins, Ma-chine Learning Summer School (MLSS07). Talk.

Wouter Boomsma & Thomas Hamelryck (2006) Full Cyclic Coordinate Descent: Solv-ing The Protein Loop Closure Problem In Cα Space, European Conferences onComputational Biology (ECCB06). Poster.

TeachingLinux and Python Programming, 2007. Shared course.Linux and Python Programming, 2006. Full course.Linux and Python Programming, 2005. Full course.Introductory Course: Computing (Perl), 2004. Full course.

Protein Structure - kupeople.binf.ku.dk/wb/dissertation.pdf · motivation for our approach will be...

Documents

Transcript of Protein Structure - kupeople.binf.ku.dk/wb/dissertation.pdf · motivation for our approach will be...