ICES REPORT 14-25 Learning optimized scoring models for ...

13
ICES REPORT 14-25 August 2014 Learning optimized scoring models for protein-protein docking by Muhibur Rasheed, Qiming Yuan, Chandrajit Bajaj The Institute for Computational Engineering and Sciences The University of Texas at Austin Austin, Texas 78712 Reference: Muhibur Rasheed, Qiming Yuan, Chandrajit Bajaj, "Learning optimized scoring models for protein-protein docking," ICES REPORT 14-25, The Institute for Computational Engineering and Sciences, The University of Texas at Austin, August 2014.

Transcript of ICES REPORT 14-25 Learning optimized scoring models for ...

Page 1: ICES REPORT 14-25 Learning optimized scoring models for ...

ICES REPORT 14-25

August 2014

Learning optimized scoring models for protein-proteindocking

by

Muhibur Rasheed, Qiming Yuan, Chandrajit Bajaj

The Institute for Computational Engineering and SciencesThe University of Texas at AustinAustin, Texas 78712

Reference: Muhibur Rasheed, Qiming Yuan, Chandrajit Bajaj, "Learning optimized scoring models forprotein-protein docking," ICES REPORT 14-25, The Institute for Computational Engineering and Sciences, TheUniversity of Texas at Austin, August 2014.

Page 2: ICES REPORT 14-25 Learning optimized scoring models for ...

Learning optimized scoring models for protein-protein docking

Muhibur Rasheed, Qiming Yuan, Chandrajit Bajaj

Center for Computational Visualization

Institute of Computational Engineering and Sciences

The University of Texas

Austin, Texas 78712

Abstract

We present a multi-stage machine learning approach to automatically choose weighting and threshold�lter parameters, in multi-term scoring functions and �lters. Such scoring functions and �lters are oftenused in protein-protein or protein-ligand (small molecule) docking software. In particular we have appliedand compared the improvement of docking results achieved by using our Fast Fourier based docking andre-ranking software called F2Dock. We additionally tabulate and compare the learning based resultsachieved by using both linear summation learning models (via quadratic program optimization) andnon-linear learning models (using random forest classi�ers).

1 Introduction

Protein-Protein docking is often modeled as an optimization problem involving a multi-term scoring functionwhere each term may also contain multiple internal parameters. Hence in general, parameters in most scoringfunctions can be categorized into two groups: (1) Internal parameters which are often used to calibrate eachstatistical or biophysical scoring term; (2) Inter-weight parameters which are assigned as a normalized weightfor each scoring term in a multi-term linear combination. While internal parameters are often derivedfrom empirical data (e.g. parameters in Amber force �eld model) most inter-weight parameters are stillmanually assigned and far from being optimal. However, with a large number of terms it is often infeasibleto manually search for optimal parameters, an motivates the design of automated learning schemes. Severalmachine learning based approaches have been proposed recently. Ravikant and Elber in [22] used quadraticprograming [20] to learn 441 inter parameter for residue contact potentials in which their objective functionmaximizes a distance metric between correct solution and decoys. Similarly, Andersson et al. in [3] useda multi-variance approach to optimize linear parameter sets. Hetenyi et al. in [15] used linear regressionmodels to learn the linear relationship between multiple score terms and empirical binding free energy. Otherthan linear models, Teramoto et al. in [27] used random forest classi�er as supervised scoring method andGozalbes et al. in [14] learned threshold parameters from statistical results of empirical data. Adaptiveparameter swiping is another type of learning approach in which the initial parameter values are randomlysampled and then updated in an iterative manner. Pham et al in [21] and Seifert et al. in [23] used gradientdescend and line search for high dimensional searching and Antes et al. in [4] used a neural network toparameterize the relationship between parameter vector and objective function (RMSD for top 5 results).Genetic programming [24] has also been applied in parameter space search where evolution and mutationare exploited to update the generation of parameter vectors. Another score driven approach is applied byYang et al. in [29] in which multiple criteria (Z-scores etc.) are used to instruct the iterative steps. Noticethat the learning approaches presented in all the previous work assume homogeneity of the parameters theyare trying to learn, i.e. they are applicable for either linear sums of weighted terms, or for interval thresholdsetc. In this paper, we identify the di�erent types of parameter spaces and the need for tailoring the learningmethods for each. We design a multi-stage mixed learning model, a combination of quadratic programmingand random forest classi�ers, for optimizing the parameters of the scoring functions used in our Fast Fourierbased protein-protein docking software F2Dock [6, 10]. Several comparative learning based results that are

1

Page 3: ICES REPORT 14-25 Learning optimized scoring models for ...

Figure 1: High-level overview of rigid-body protein-protein docking using F2Dock and GB-rerank.

achieved by both linear learning models based on quadratic programming, and non-linear learning modelsbased on random forest classi�ers are presented.

2 Overview of F2Dock

Figure 1 gives a high level overview of F2dock. The algorithm consists of two separate phases, given twoproteins A and B with MA and MB atoms respectively, where MA ≥ MB , i.e., A is the larger of the twoproteins. We refer to A as the "receptor" and B as the "ligand".

2.1 Phase I (Exhaustive 6D Scoring and Search with FFT):

F2Dock exhaustively searches over a discretized SO3×R3 space. First, it samples the rotational space SO3

uniformly [19] and applies the rotation on the ligand after placing it's mass center at the origin. Then itscores each relative positions of the rotated ligand w.r.t. the stationary receptor over the set of translationsin R3. FFT (Fast Fourier Transform) based 3D convolution is used to compute scores for all points on theuniform 3D translational grid. The current version uses uniform FFT, but exploits the sparsity of FFT gridsfor faster execution, and also restricts its search within a narrow band around the larger molecule. The topseveral thousand poses from this FFT-based scoring phase are inserted in a priority queue sorted by thescores.

The scoring function is a weighted combination of shape complementarity, electrostatics and interfacepropensity (or hydrophobicity) based a�nity terms. We brie�y introduce the terms here (refer to [6, 10] forfurther details).

2.1.1 Shape Complementarity:

Shape complementarity was originally introduced to model the lock-and-key matching idea of docking, i.ethe proteins have complementary shapes at the binding interface. Energetically, this models the van derWaals interaction to some extent.

F2Dockuses an improved double-skin layer approach to shape complementarity. Atoms on the ligandwhich are exposed to the solvent are considered skin atoms, and the remaining atoms are considered core.All atoms of the receptor are core and a layer of atoms are added outside the solvent excluded surface (SES)of the receptor and these extra atoms are considered the skin for the receptor. The shape complementarityfunction is designed to maximize skin-skin overlaps and minimize skin-core and especially core-core overlaps,since it indicates that the ligand is coming close to the receptor without penetrating it. This can be achievednumerically by assigning positive real a�nity values on the skin atoms and positive imaginary a�nity valueson the core atoms. Hence, a convolution would generate positive real contributions SSS from skin-skinoverlaps, negative real contributions SCC from core-core overlaps and positive imaginary contributions SSCfrom skin-core overlaps. The complete shape complementarity score is de�ned as a weighted combination of

2

Page 4: ICES REPORT 14-25 Learning optimized scoring models for ...

them - Sshape = WSSSSS+WCCSCC+WSCSSC , whereWSS ,WCC andWSC control the relative importanceof the di�erent terms. For example, a high WCC would heavily penalize any penetrations.

F2Dock improves on the above double-skin idea by further re�ning the a�nity values assigned to theatoms. First, F2Dock uses a depth based scheme for assigning a�nity values to the core atoms whichpenalizes deeper penetrations more than shallow ones. Secondly, F2Dock assigns the a�nity values on skinatoms based on curvature to promote binding near pockets and mouths. And �nally, the receptor's skinatoms do not touch its SES, but are placed a small distance away to de�ne a `�oating' skin. This tries tomimic the fact that in most complexes the SES of the ligand and receptor does not actually touch each other,but rather stay at a very close distance.

2.1.2 Electrostatics:

F2Dock models long distance electrostatic interactions using the simpli�ed model for Coloumbic electrostaticsproposed by Gabb et. al. [12] which allows e�cient FFT-based computation of the term during dockingsearch. Two a�nity functions fEA and fEB are de�ned for molecule A and B, respectively.

fEA (x) =∑k∈A

qkE(x− ck)(x− ck)

· gEA,k(x)

and fEB (x) =∑k∈B

qkδ(x− ck) · gEB,k(x),

where, qk is the Coloumbic charge1 on atom k, δ(x) is the Kronecker delta function with value 1 at ||x|| = 0,and 0 everywhere else, gEA,k(x) = gEB,k(x) = 1 and E(x) is the distance dependent dielectric constant [12].The convolution of the two a�nity functions produce the electrostatic score Selec

2.1.3 Interface Propensity or Hydrophobicity:

Unlike shape complementarity (vdW) and electrostatics (Coloumbic) terms which are based on molecularfree energies, the Interface propensity or Hydrophobicity term is based completely on statistical and empiricalobservations. It has been observed that Hydrophobic residues tend to be found near the core of the moleculesand Hydrophilic residues are found on the surface. However, if there are some Hydrophobic residues on thesurface of a protein, then they tend to act as binding sites so that they can get buried by forming a complexwith another protein. Based on this idea, we reward docking poses where the binding interface containsHydrophobic residues. We use per residue Hydrophobicity values from [8] to de�ne weights on the surfaceatoms and then compute the convolution using FFT. Refer to [10] for details about the formulation andFFT adaptation.

However, there are other factors which also promotes the possibility of a residue to be on the bindinginterface. Jones and Thornton [17, 16] studied the interface of 63 protein-protein interfacs and computed ainterface propensity value IP for each residue type. The interface propensity is de�ned as the log normalizedprobability of a residue being on the binding interface given that it is present in the molecule. The IP valuesfor the 20 amino acid residues lie between -0.38 (for ASP) and 0.83 (for TRP). A residue with a higherIP value is likely to occur more frequently in a protein-protein interface than one with a lower IP value.After assigning the IP values to atoms (based on their residues), we follow the same technique used forHydrophobicity to compute a score SIP for a given pose.

In F2Dock, we model interface propensity and hydrophobicity as alternate ways of modeling the samephenomenon and the user has a option to select either one.

2.1.4 Overall score of Phase I

The overall score of phase I is-SphaseI = SshapeWshape + SelecWelec + SIPWIP

where, Wshape, Welec and WIP controls the relative importance of the di�erent scoring terms. We haveobserved that the contribution of the terms vary signi�cantly for di�erent class of complexes (Enzymes,Antibodies, Others etc.), and hence the weights must be learned separately for each class.

1charge assignments are made using PDB2PQR [2]

3

Page 5: ICES REPORT 14-25 Learning optimized scoring models for ...

2.2 Phase II (Bioinformatic scoring terms computed by fast multipole meth-ods):

In this phase, the top several thousand poses, inserted into the priority queue in phase I based on theirSphaseI , are evaluated using several statistical scoring terms to prune away false positives. Each of the termsreward or penalize a pose by updating its score. When all the terms have been applied, the �nal updatedscores are used to re-rank the poses. The terms are computed using fast multipole type recursive spatialdecomposition techniques [11].

We brie�y introduce these terms below. Refer to [10, 11] for details.

2.2.1 Proximity Clustering

The poses in the priority queue are examined for structural similarity based on how close the geometriccenters of B are to each other (note that A is kept static). If a structurally similar docking pose with abetter score exists, then the docking pose with lower score is further penalized by reducing its score.

2.2.2 Lennard-Jones:

Penalize if the Lennard-Jones potential of a docking pose is above a threshold. We approximate the Lennard-Jones (LJ) potential between molecules A and Bt,r given by the following expression.

LJ(A,Bt,r) =∑i∈A,j∈Bt,r

(aij

r12ij− bij

r6ij

),

where rij is the distance between atoms i ∈ A and j ∈ Bt,r, constants aij and bij depend on the type(e.g., C, H, O, etc.) of the two atoms involved. For any �xed pair of atom types aij and bij are �xed, andare calculated from the Amber force �eld.

2.2.3 Steric Clash:

Penalizes all docking poses with the number of steric (atom-atom) collisions above a threshold. Two atomsa ∈ A and b ∈ B with van der Waals radii ra and rb, respectively, are said to be in a clash provided thedistance between their centers is smaller than α(ra + rb), where α is a user-de�ned positive constant.

2.2.4 Interface Area (Dispersion):

Penalizes a docking pose if the interface area is outside acceptable range. Interface area is computed byde�ning a smooth surface representation (triangulated mesh) of the proteins. Then Gaussian quadraturepoints are sampled on the triangles such that the weight of a quadrature point corresponds to the area of thetriangle supporting it. Hence, the interface area is the sum of the weights of the quadrature points whichare on the interface (have a neighboring quadrature point on the other surface within a distance threshold).

2.2.5 Interface Propensity or Hydrophobicity:

Using statistical information from [17] it computes a score for each pose which from a high level can beviewed as the ratio of the interface area of the pose corresponding to residues that typically appear in highfrequencies in protein interfaces to the interface area corresponding to residues that appear in low frequencies.A docking pose is penalized if this ratio is below a threshold. Alternatively, Hydrophobicity values from [8]can be used.

2.2.6 Residue-Residue Contact:

It was observed in [13] that large hydrophobic residue pairs typically have high contact preferences while thesmallest contact preferences were observed between pairs of residues that are small in size. Interfaces do notseem to favor contacts between hydrophobic and polar residues, and between charged residues that do nothave charge complementarity. F2Dock uses the pairwise contact preference values listed in either Table III(without volume normalization) or Table IV (normalized w.r.t. residue volumes) of [13]. This term penalizesa pose if the sum of residue-residue contact values of the given pose fall below a threshold. Two residues areconsidered to be in contact if the distance between their Cβ atoms (Cα for Gly) is less than 6 Å.

4

Page 6: ICES REPORT 14-25 Learning optimized scoring models for ...

2.2.7 Antibody-Antigen Contact:

This term uses statistical information on antibody-antigen contact preferences derived in [18, 1]. It is basedon the observation that in each antibody each of the following three regions will make at least one antigencontact: (1) either CDR-L1 or CDR-H1, (2) CDR-L3, and (3) CDR-H3.

2.2.8 Glycine Richness:

This term exploits the observation that enzyme active sites are rich in glycines, particularly G-X-Y andY-X-G oligopeptides, where X and Y are polar and non-polar residues, respectively, and G is glycine [28].

2.3 Phase III (Solvation Energy Based Reranking):

The ranked docking poses obtained from phase I are re-scored and reranked based on the change in solvationenergy caused by each pose. The polar part of the solvation energy is approximated using the surface-basedformulation of Generalized Born (GB) energy [7], and implemented using a fast octree-based approxima-tion scheme which we describe in detail in [11]. Among the non-polar parts the dispersion energy is alsoapproximated using octrees while the cavity forming energy is approximated by computing an approximateinterface area of the two molecules using our fast linear-space Dynamic Packing Grid (DPG) data structuredescribed in [5].

3 Quadratic Programming based Parameter Optimization

The quadratic programming approach targets at maximizing the separating distance between correct andincorrect solutions. For transformation τ , recall the scoring function

E(τ) = wT · Pτ

The parameter vector we need to train is

w = [wss, wsc, wcc, welec, whbond, wip]

and Pτ is the corresponding feature vector for a given transformation τ . The training data (input) ofthe algorithm is a set of n receptor-ligand pairs X = {X1, X2, ..., Xn} where Xi = {Xi1, Xi2} and theircorresponding correct transformation τ = {τ1, τ2, ..., τn}. The algorithm input also includes a constant C,tolerated error v and size of region in output space ε.

At the beginning of the algorithm, we sample set of clusters Γ0i = {G1, G2, ...} where Gj = {τ1, τ2, ...}

of incorrect transformations for each receptor-ligand pair Xi1, Xi2. In this de�nition, Γk is a set of clusters,Gk is a cluster and τk is a transformation. Then we compute the set constraints for each set of clusters suchthat

Si ← {∀Gk ∈ Γ0i∀τ

ji ∈ Gk : wT (Pτi

− Pτji) ≥ 1− δik

∆(τi, τji )}

In this step, we compute a slack variable δik for each cluster Gk so that the error between all transformationsτ ji in Gk and the correct transformation τi cannot violate the minimum threshold 1− δik

∆(τi,τji )

where ∆(τi, τji )

is the rmsd value between τi and τji . We solve the QP

(w, δ) = arg minw,δ

12||w||2 +

C

n

∑i,k

δik

to get the minimum value of each slack variable δik which maximize the separating distance between scoresof correct and incorrect transformations. Until now we have assigned the initial values for the variableswhich will update during iterations:

• Γ0i : Initial set of clusters

• w: Initial values for parameter vectors

5

Page 7: ICES REPORT 14-25 Learning optimized scoring models for ...

• δ: Initial values for slack variables

Then we begin the iteration. In the αth iteration, we �nd the set of top violated transformations Tαi in eachset of clusters Γα−1

i and re-cluster them. For each cluster Gk in Tαi , we will use it to generate the kth cluster

Γαik for the new set of clusters Γαi . The generating process is described as: For each transformation τ ji in Gk,if it is possible for it to violate the constraint for the kth cluster such that

∆(τi, τji )(1− wT (Pτi − Pτj

i)) > 0

We'll determine whether to add this transformation into the new cluster Γαik by the following two criteria:

• If the kth cluster exists in set Γα−1i but transformation τ ji violates the slack variable δik that

wT (Pτi− Pτj

i) < 1− δik + v

∆(τi, τji )

add τ ji into Γαik.

• If the kth cluster doesn't exist in set Γα−1i , add τ ji into Γαik.

If either criteria is met, we will add the transformation into Γαik. After all the new clusters are generated,we have the new set of clusters Γαi . Similar as the beginning, we set constraint δik for each cluster Gk in Γαisuch that

Si ← {∀Gk ∈ Γαi ∀τji ∈ Gk : wT (Pτi

− Pτji) ≥ 1− δik

∆(τi, τji )}

and solve the QP

(w, δ) = arg minw,δ

12||w||2 +

C

n

∑i,k

δik

again to update w and δik.Until now the three variables Γ, w and δ are all updated and then we use them to begin the iteration

α+ 1. The iteration will terminate if no new constraint is generated during the iteration and the w at thattime is our �nal output.

4 Random Forest Classi�er

The random forest classi�er [9, 26] consists of multiple decision tree classi�ers each of which satis�es theconditions below:

• Nodes are constructed from a subset of data. Root node contains all data. Each data item is a vector.

• At each node, search through all variables to �nd best split into two children nodes.

• Split all the way down and then prune tree up to get minimal test error

The generation process of the random forest is described as:

• Root node contains a bootstrap sample of data of the same size as original data. Each tree has di�erentbootstrap sample

• For each node, a random subset of training samples are selected and the best split is found by searchingthrough these samples. The search algorithm is called classi�cation and regression tree (CART) searchwith Gini criterion. The Gini criterion is a statistical measure of the statistical dispersion of a set ofdata points. In the CART algorithm, each split only depends on one predictor value i.e. one dimensionof the training sample. For a training set with k training samples, there are k possible splits for eachdimension.

6

Page 8: ICES REPORT 14-25 Learning optimized scoring models for ...

• Find each dimension's best split: Sort the value of this dimension from the smallest to the largest. Forthe sorted values, go through each value from top to examine each candidate split point (call it v, ifx CORRECT v, the case goes to the left child node, otherwise, it goes to the right) to determine thebest. The best split point is the one that maximize the splitting criterion the most when the node issplit according to it.

• the Gini criterion for node t is de�ned as:

∆i(s, t) = i(t)− PLi(tL)− PRi(tR)

wheret(t) = 1 =

∑j

p2(j|t)

Gini criterion describes the purity of the node and the split the maximize ∆i(s, t) will be found. Notethat for n k-dimensional examples there are nk possible splits.

• Normally a large number of trees (> 100) are generated and the �nal prediction result is the averageof the votes of all trees.

For a docking exercise, we represent each relative pose using a vector containing the score for eachindividual term of the multi-term scoring model. These score vectors are used as data items for the randomforest. During the training step, we label score vectors corresponding to transformations with rmsd smallerthan 5 as positive training sample and all transformations with rmsd greated than 20 are labeled as negative.We have used 1342 positive training samples and 214,532 negative training samples in total drawn from allcomplexes. The number of decision trees contained in the forest is set to be 500. Given a test pose, wesimply compute the rf-score as the average of the votes of all trees with the corresponding score vector usedas input. Finally, for each complex, we take the originally ranked results (top 10, 000) and use their rf-scoresfor reranking.

5 Results

First we compare the total number of hits found by the untrained scoring model with quadratic programmingand random forest based learning models (Figures 2 to 4). For di�erent types of complexes, we comparethe number of hits found within di�erent ranges of the solutions. We see that the both learning methodsincrease the total number of hits (compared with the baseline untrained method) for each complex type, andfor almost all of the intervals/ranges. However, they slightly underperform for the other type of complexes,where they fail to push the ranks of good solutions above 10; and hence even though they �nd more hits intotal, they compare unfavorably in the �rst three intervals shown in Figure 4. We believe it is due to thehigh level of variability in the set of complexes in the 'other' type. Next, we show a complex wise breakdownof the total number of hits found in the top 1000 results in Figure 5. This �gure shows that the learningmethods improve the result for most of the complexes. It also shows a few exceptions and tradeo�s. Forexample, quadratic programming fails to �nd any hits for 1FFW and 2I25 for which the untrained versionfound 1 hit, on the other hand quadratic programming �nds a hit for 2ABZ and 2FD6 for which none ofthe other methods found any hits. Random forest always improves on or maintains similar results as theuntrained one except for 1HCF, 1KXQ and 1B6C.

However, �nding more hits is not the only objective, we also need to ensure that the hits are foundhigher in the rankings. So, we ahve plotted RoC curves[25] for quadratic programming and random forestapproaches (Figure 6 to 8). These curves plot the fraction of hits detected within di�erent fractions of topranked solution. We can see the learning approach signi�cantly improve the ranks of the correct results. Forexample, for the antibodies, we need to consider the top 17%, 15% and 7% of the results to �nd 50% of thehits if the untrained, quadratic programming and random forest based scoring models are used respectively.Hence, random forest ranks the correct results higher. Random forrest and quadratic programming o�erssimilar improvements for the other type of complexes. However, the quadratic programming approach doesnot perform so well for the Enzymes.

7

Page 9: ICES REPORT 14-25 Learning optimized scoring models for ...

Figure 2: Total number of hits in range for antibody complexes.

Figure 3: Total number of hits in range for enzyme complexes.

6 Conclusion

We have applied two supervised machine learning techniques to optimize the prediction accuracy of scoringfunctions used in molecular-molecular docking, and in particular the scoring function parameters of ourF2Dock program. In most cases we achieved signi�cant prediction improvement in our docking results. Wemeasured the improvement on di�erent type of complexes and present the hit charts and RoC curves. Fromthese experimental results, we can conclude that learning techniques can help optimize scoring functions andre�ne the scoring based ranking of predicted complexes. We also compared the twin machine learning tech-niques we applied, namely the linear weighted summation learning model (using Quadratic Programming)and non-learning learning models (using Random Forest classi�ers). From the experiments we concludethat the non-linear models performed better as it combined the learning of parameters used in weighting

8

Page 10: ICES REPORT 14-25 Learning optimized scoring models for ...

Figure 4: Total number of hits in range for other complexes.

multiple scoring terms, as well as the parameters in scoring and ranking �lters. For future work, we proposeto analyze additional machine learning algorithms including non-linear SVM, and genetic programming.

References

[1] Antibody-antigen contacts. http://www.bioinf.org.uk/abs/allContacts.html.

[2] PDB2PQR: An automated pipeline for the setup, execution, and analysis of Poisson-Boltzmann elec-trostatics calculations. http://pdb2pqr.sourceforge.net/.

[3] C. D. Andersson, E. Thysell, A. Lindström, M. Bylesjö, F. Raubacher, and A. Linusson. A multivari-ate approach to investigate docking parameters e�ects on docking performance. Journal of ChemicalInformation and Modeling, 47(4):1673�1687, 2007.

[4] I. Antes, C. Merkwirth, and T. Lengauer. Poem: Parameter optimization using ensemble methods: ap-plication to target speci�c scoring functions. Journal of Chemical Information and Modeling, 45(5):1291�1302, 2005.

[5] C. Bajaj, R. A. Chowdhury, and M. Rasheed. A dynamic data structure for �exible molecular mainte-nance and informatics. Bioinformatics, 27(1):55�62, 2010.

[6] C. Bajaj, R. A. Chowdhury, and V. Siddavanahalli. F2Dock: Fast Fourier Protein-Protein Docking.IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(1):45�58, 2011.

[7] C. Bajaj and W. Zhao. Fast molecular solvation energetics and forces computation. SIAM Journal onScienti�c Computing, 31(6):4524�4552, 2010.

[8] S. Black and D. Mould. Development of hydrophobicity parameters to analyze proteins which bearpost- or cotranslational modi�cations. Analytical Biochemistry, 193:72�82, 1991.

[9] L. Breiman. Random forests. Machine Learning, 45(1):5�32, 2001.

[10] R. Chowdhury, M. Rasheed, D. Keidel, M. Moussalem, A. Olson, M. Sanner, and C. Bajaj. Protein-protein docking with f2dock 2.0 and gb-rerank. PLOS one, Under Review, 2012.

9

Page 11: ICES REPORT 14-25 Learning optimized scoring models for ...

Figure 5: Number of hits in top 1000 results for each complexes. For 5 complexes, the learning methods�nd a hit even when the untrained version found no hits at all. However, quadratic programming loses allhits for 3 complexes where the untrained scoring model �nds some hit. Though the random forest learningsometimes gets fewer total hits that the untrained, it improves the result in general.

Figure 6: ROC curve for Antibody complexes. The curve is plots the fraction of correct results found withindi�erent fractions of all results. The results indicate that Random Forest learning �nds most of the correctsolutions near the top ranks. It �nds almost 50% of all correct solutions within the top 5% of all results itreports.

[11] R. A. Chowdhury and C. Bajaj. Algorithms for faster molecular energetics, forces and interfaces. ICESreport 10-32, Institute for Computational Engineering & Science, The University of Texas at Austin,August 2010.

[12] H. A. Gabb, R. M. Jackson, and M. J. E. Sternberg. Modelling protein docking using shape com-plementarity,electrostatics and biochemical information. Journal of Molecular Biology, 272(1):106�120,

10

Page 12: ICES REPORT 14-25 Learning optimized scoring models for ...

Figure 7: ROC curve for Enzyme complexes. Again random forest o�ers the most improvement.

Figure 8: ROC curve for Other complexes. The learning methods does not seem to o�er much improvement,and actually has slightly lower accuracy than untrained scoring function near the top of the ranks.

1997.

[13] F. Glaser, D. M. Steinberg, I. A. Vakser, and N. Ben-Tal. Residue frequencies and pairing preferencesat protein-protein interfaces. PROTEINS: Structure, Function, and Genetics, 43:89�102, 2001.

[14] R. Gozalbes, L. Simon, N. Frolo�, E. Sartori, C. Monteils, and R. Baudelle. Development and ex-perimental validation of a docking strategy for the generation of kinase-targeted libraries. Journal ofMedicinal Chemistry, 51(11):3124�3132, 2008.

[15] C. Hetényi, G. Paragi, U. Maran, Z. Timár, M. Karelson, and B. Penke. Combination of a modi�edscoring function with two-dimensional descriptors for calculation of binding a�nities of bulky, �exibleligands to proteins. Journal of the American Chemical Society, 128(4):1233�1239, 2006.

11

Page 13: ICES REPORT 14-25 Learning optimized scoring models for ...

[16] S. Jones and J. M. Thornton. Principles of protein-protein interactions. Proceedings of the NationalAcademy of Sciences of the United States of America, 93(1):13�20, 1996.

[17] S. Jones and J. M. Thornton. Analysis of protein-protein interaction sites using surface patches. Journalof Molecular Biology, 272(1):121�132, 1997.

[18] R. MacCallum, A. Martin, and J. Thornton. Antibody-antigen interactions: contact analysis andbinding site topography. Journal of Molecular Biology, 262(5):732�745, 1996.

[19] J. C. Mitchell. Personal Communication, University of Wisconsin - Madison.

[20] J. Nocedal and S. Wright. Numerical Optimization (2nd ed.). Berlin, 2nd edition, 2006.

[21] T. A. Pham and A. N. Jain. Parameter estimation for scoring protein-ligand interactions using negativetraining data. Journal of Medicinal Chemistry, 49(20):5856�5868, 2006.

[22] D. V. S. Ravikant and R. Elber. Pie-e�cient �lters and coarse grained potentials for unbound protein-protein docking. Proteins, 78(2):400�419, 2010.

[23] M. H. J. Seifert. Optimizing the signal-to-noise ratio of scoring functions for protein�ligand docking.Journal of Chemical Information and Modeling, 48(3):602�612, 2008.

[24] R. Smith, R. E. Hubbard, D. A. Gschwend, A. R. Leach, and A. C. Good. Analysis and optimizationof structure-based virtual screening protocols. (3). new methods and old problems in scoring functiondesign. Journal of molecular graphics modelling, 22(1):41�53, 2003.

[25] K. A. Spackman. Signal detection theory: valuable tools for evaluating inductive learning. In A. M.Segre, editor, Proceedings of the Sixth International Workshop on Machine Learning, pages 160�163.Morgan Kaufmann Publishers Inc., 1989.

[26] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston. Random forest: aclassi�cation and regression tool for compound classi�cation and qsar modeling. Journal of ChemicalInformation and Computer Sciences, 43(6):1947�1958, 2003.

[27] R. Teramoto and H. Fukunishi. Structure-based virtual screening with supervised consensus scoring:evaluation of pose prediction and enrichment factors. Journal of Chemical Information and Modeling,48(4):747�754, 2008.

[28] B. Yan and Y. Sun. Glycine residues provide �exibility for enzyme active sites. Journal of BiologicalChemistry, 272(6):3190, 1997.

[29] Y. D. Yang, C. Park, and D. Kihara. Threading without optimizing weighting factors for scoringfunction. Proteins, 73(3):581�596, 2008.

12