dzb0050/pubs/2016_3_support.… · Web viewSupplementary Information for “FRAGSION: ultra-fast...

Supplementary Information for“FRAGSION: ultra-fast protein fragment library

generation by IOHMM sampling”

Debswapna Bhattacharya1, Badri Adhikari1, Jilong Li1 and Jianlin Cheng1, 2, 3, *

1Department of Computer Science, University of Missouri, Columbia, MO 65211, USA2Informatics Institute, University of Missouri, Columbia, MO 65211, USA3Bond Life Science Center, University of Missouri, Columbia, MO 65211, USA

*To whom correspondence should be addressed. Phone: (573)-882-7306. Fax: (573)-882-8318. E-mail: [email protected].

Supplementary Item Title

Supplementary Method

Supplementary Results

Supplementary Figure 1 Architecture of Input-Output Hidden Markov Model (IOHMM).

Supplementary Figure 2 Training and optimal model selection.

Supplementary Figure 3 Density of TM-score and RMSD of the FRAGSION and ROSETTA models.

Supplementary Figure 4 Target by target comparison between FRAGSION and ROSETTA in terms of precision.

Supplementary Figure 5 Target by target comparison between FRAGSION and ROSETTA in terms of coverage.

Supplementary Figure 6 Target by target comparison between FRAGSION and ROSETTA in terms of RMSD.

Supplementary Figure 7 Target by target comparison between FRAGSION and ROSETTA in terms of computation time.

Supplementary Table 1 Template Free Modeling (FM) targets for CASP 11 experiment.

Supplementary Table 2 Mean and standard deviation of TM-score and RMSD of the FRAGSION and ROSETTA models.

Supplementary Table 3 Highest TM-score and lowest RMSD for each target by FRAGSION and ROSETTA.

Supplementary Table 4 TM-score and RMSD of the lowest energy model for each target by FRAGSION and ROSETTA.

Supplementary Methods

Description of IOHMM

FRAGSION is developed using our recently proposed Input-Output Hidden Markov Model (Bhattacharya and Cheng, 2015). The proposed model captures sequential dependencies between the sequence space (input) and structural space (output) of protein through a Markov chain of hidden states. In each slice, as shown in Supplementary Fig. 1, an input node (A) captures the sequence space. It represents eight groups of residues showing distinct structural behavior selected from twenty standard residue types as previously found through analysis of high-resolution experimental structures (Karplus, 1996; Lovell, et al., 2003) Theses eight classes are: (1) glycines not preceding prolines, (2) prolines not preceding prolines, (3) β-branched amino acid residues, isoleucines and valines, not preceding prolines, (4) all amino acids except glycines, prolines, isoleucines, and valines not preceding prolines, (5) glycines preceding prolines, (6) prolines preceding prolines, (7) β-branched residues isoleucines and valines preceding prolines, and (8) all amino acids except glycine, proline, isoleucine, and valine preceding prolines. Connections between the input nodes represent the transition probabilities between residues along the protein chain. Output (i.e., emission) nodes correspond to structural space, modeled using secondary structure (S), dihedral angle pair (D: ϕ, ψ), and peptide bond conformation (P: ω). Secondary structure node (S) is a discrete node that can assume 3 states (Helix, Strand and Coil). We model backbone torsion angles pairs (ϕ, ψ) using mixtures of bivariate von Mises distributions (Mardia, et al., 2007) and ω dihedral angle of the peptide bonds using mixtures of univariate von Mises distributions (Mardia and Jupp, 2009). The output emission nodes can be flagged as observed or hidden for a specific sequence position. Sampling sequence of hidden nodes H and the emission nodes marked as hidden, Ohidden from the conditional distribution P(H, Ohidden | Oobs, I) is achieved using forward-backtrack algorithm (Cawley and Pachter, 2003), where input node I and observed emission nodes Oobs are given. This enables us to deal with noise in the sequence-derived predicted secondary structure by flagging secondary structure as observed only in residue positions for highly confident prediction and leaving the rest as hidden. Furthermore, using a probabilistic model makes it possible to sample potentially unlimited sequence of angles accessible to proteins with associated probabilities for a given stretch of sequence.

Supplementary Figure 1. Architecture of Input-Output Hidden Markov Model (IOHMM). In each slice, an input node indicated eight classes of residues in the amino acid sequence (A) and a Markov chain of hidden nodes (H) captures the sequential dependencies along the peptide chain where each hidden node corresponds to three kinds of emission distributions: (1) three-state secondary structure

labels (S): helix (H), strand (E), and coil (C), (2) backbone (ϕ, ψ) dihedral angle pairs, and (3) ω angles associated with peptide bonds.

Training Data

To train the IOHMM, we collected 1,740 non-redundant protein domains,from the SABmark dataset, version 1.65 (Van Walle, et al., 2005). Eight classes of residue types and three backbone dihedral angle information were calculated directly from the training protein and three-state secondary structures (helix, strand, and coil) were assigned using DSSP (Kabsch and Sander, 1983). The training dataset contains 270,350 observations.

Training and Optimal Model Selection

We trained the IOHMM using Stochastic Expectation-Maximization (S-EM) (Nielsen, 2000) algorithm, as implemented in Mocapy++ software package (Paluszewski and Hamelryck, 2010). Choosing the optimal hidden node size is crucial for the model to succeed. For low size, the model will be too coarse; however, if the size is too high, it will lead to overfitting. We estimated the optimal hidden node size using the Akaike Information Criterion (AIC) (Burnham and Anderson, 2002), a widely-used model selection criterion:

where, L(θ|d) is the likelihood of the model given the data d, and n is the number of parameters. The AIC value reaches a minimal value for the optimal model. We computed AIC values for hidden node sizes ranging from 10 to 100 (with a step size of 5). For each hidden node size, we repeated the training four times with different starting conditions in order to avoid getting stuck in local optima. For a model with a hidden node size of 30, the AIC value reached its minimum value, resulting in 7,812 parameters (Supplementary Fig. 2). We chose this model as the optimum one.

Supplementary Figure 2. Training and optimal model selection. (a) AIC values verses varying hidden node sizes are shown, with four models trained for each hidden node size. The curved line is tendency line constructed by fitting sixth degree polynomial to the data. The minimum AIC value corresponds to the optimal model (highlighted in red circle). (b) Convergence of log likelihood of the completed data during training is shown with respect to the number of S-EM iterations.

Test dataset

We tested the accuracy of fragment library using 30 CASP11 FM targets. The sequences, and the experimental PDB structures were downloaded from the CASP11 website at http://predictioncenter.org/download_area/CASP11/targets/. The domain definitions and the PDB accession codes were provide by CASP assessors at http://predictioncenter.org/casp11/domains_summary.cgi. A summary of the targets have been provided in Supplementary Table 1.

Supplementary Table 1. Template Free Modeling (FM) targets for CASP 11 experiment.# Target Domain Residue Range Residues in Domain PDB

1 T0761 T0761-D1 62-149 88 4pw1

2 T0761 T0761-D2 150-178,202-285 113 4pw1

3 T0763 T0763-D1 31-160 130 4q0y

4 T0767 T0767-D2 133-312 180 4qpv

5 T0771 T0771-D1 27-76,91-191 151 4qe0

6 T0777 T0777-D1 18-362 345 -

7 T0781 T0781-D1 41-240 199 4qan

8 T0785 T0785-D1 3-114 112 4d0v

9 T0789 T0789-D1 6-113,117-151 143 4w4i

10 T0789 T0789-D2 152-277 126 4w4i

11 T0790 T0790-D1 1-135 135 4l4w

12 T0790 T0790-D2 136-265 130 4l4w

13 T0791 T0791-D1 6-44,52-161 149 4kxr

14 T0791 T0791-D2 162-262,264-300 138 4kxr

15 T0794 T0794-D2 291-462 172 4cyf

16 T0806 T0806-D1 1-256 256 -

17 T0808 T0808-D2 150-418 269 4qhw

18 T0810 T0810-D1 24-136 113 -

19 T0814 T0814-D1 23-159 137 4r7f

20 T0814 T0814-D2 160-242,387-419 116 4r7f

21 T0820 T0820-D1 2-91 90 -

22 T0824 T0824-D1 2-109 108 -

23 T0827 T0827-D2 212-328,337-369 150 -

24 T0831 T0831-D2 109-168,183-261,295-352 197 4qn1

25 T0832 T0832-D1 10-218 209 4rd8

26 T0834 T0834-D1 2-37,130-192 99 4r7q

http://predictioncenter.org/casp11/domains_summary.cgi

http://predictioncenter.org/download_area/CASP11/targets/

27 T0834 T0834-D2 38-65,72-129 86 4r7q

28 T0836 T0836-D1 1-204 204 -

29 T0837 T0837-D1 1-121 121 -

30 T0855 T0855-D1 5-119 115 2mqd

Fragment library generation using ROSETTA

We used the fragment picker application of ROSETTA 3.5 (Leaver-Fay, et al., 2011) with default papameter settings in order to generate fragment library using ROSETTA. For each target, at first, we predicted secondary structure using PSIPRED (Jones, 1999) and supplied it to ROSETTA (by setting ‘psipred_ss2’ to appropriate file path).

The fragment picker command used was:

./rosetta-3.5/rosetta_source/rosetta_tools/fragment_tools/make_fragments.pl \-rundir DIR-TARGET \-id TARGET \-nopsipred \-nohoms \-psipredfile TARGET.ss2 \-frag_sizes 3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 \TARGET.fasta \

Model Generation using ROSETTA

A locally installed ROSETTA 3.5 (Leaver-Fay, et al., 2011) was used to build three-dimensional models using the fragment files predicted by ROSETTA and FRAGSION as inputs. For each target, at first, we predicted secondary structure using PSIPRED and then generated 100 models (by setting ‘nstruct’ option to 100) with all default parameters as input to the ‘AbinitioRelax’ program. Supplying Rosetta’s default input of three-size and nine-size fragments (parameters ‘in:file:frag3’ and ‘in:file:frag9’) we ran single thread of ‘AbinitioRelax’ for short targets and three parallel threads for long targets.

The ‘AbinitioRelax’ command used was:

./rosetta-3.5/rosetta_source/bin/AbinitioRelax.linuxgccrelease \-database DIR-ROSETTA-DB\-in:file:fasta ./TARGET.fasta \-in:file:frag3 ./TARGET.200.3mers \-in:file:frag9 ./TARGET.200.9mers \-psipred_ss2 TARGET.ss2 \-nstruct 100-abinitio:relax \-relax:fast \-abinitio::increase_cycles 10 \-abinitio::rg_reweight 0.5 \-abinitio::rsd_wt_helix 0.5 \

-abinitio::rsd_wt_loop 0.5 \-use_filters true \-out:pdb \

Supplementary Results

Assessment of the overall accuracy of predicted models

The predicted protein models by FRAGSION and ROSETTA were analyzed based on domains as done in the CASP experiments. Residues in the predicted models that the true structures missed were removed, and the models were superposed onto the true structures for 30 CASP11 domains. TM-score and RMSD of the models were calculated by the TM-score program (Zhang and Skolnick, 2004). Supplementary Table 2 reports mean and standard deviation of TM-score and RMSD of the FRAGSION and ROSETTA models. Supplementary Fig. 8 shows density of TM-score and RMSD of the FRAGSION and ROSETTA models. The average TM-score and RMSD of the FRAGSION and ROSETTA models are 0.198 and 0.259, 19.980 Å and 17.995 Å separately. The average TM-score of the FRAGSION models is ~23.55% lower than that of the ROSETTA models. The average RMSD of the FRAGSION models is ~2 Å higher than that of the ROSETTA models.

Supplementary Table 2. Mean and standard deviation of TM-score and RMSD of the FRAGSION and ROSETTA models.

FRAGSION ROSETTA

TM-score RMSD TM-score RMSD

Target Mean STD Mean STD Mean STD Mean STD

T0761-D1 0.213 0.032 17.112 2.676 0.283 0.035 15.528 3.409

T0761-D2 0.213 0.016 19.165 2.834 0.224 0.021 18.040 2.883

T0763-D1 0.187 0.024 19.000 3.124 0.209 0.026 17.646 2.648

T0767-D2 0.179 0.022 22.799 2.441 0.224 0.036 20.711 2.473

T0771-D1 0.191 0.025 19.259 2.215 0.244 0.030 18.359 2.493

T0777-D1 0.196 0.023 23.650 1.946 0.227 0.033 22.283 2.950

T0781-D1 0.174 0.018 25.566 4.015 0.188 0.020 23.384 2.273

T0785-D1 0.187 0.020 16.001 1.613 0.217 0.023 15.258 1.766

T0789-D1 0.193 0.025 19.392 2.366 0.278 0.033 16.729 1.851

T0789-D2 0.195 0.028 19.098 2.451 0.289 0.034 15.792 2.242

T0790-D1 0.210 0.028 18.593 2.625 0.382 0.052 13.102 2.192

T0790-D2 0.198 0.025 18.506 2.424 0.290 0.055 15.661 2.230

T0791-D1 0.191 0.033 19.735 2.507 0.248 0.033 19.037 2.503

T0791-D2 0.188 0.025 20.787 2.758 0.243 0.028 17.853 2.482

T0794-D2 0.137 0.020 28.234 4.515 0.181 0.023 24.343 3.574

T0806-D1 0.180 0.020 22.561 1.765 0.214 0.030 20.518 2.016

T0808-D2 0.153 0.018 28.436 2.857 0.203 0.028 24.898 2.232

T0810-D1 0.161 0.019 18.920 2.705 0.267 0.037 16.562 2.651

T0814-D1 0.147 0.017 25.855 3.886 0.182 0.026 22.469 3.037

T0814-D2 0.158 0.022 25.206 4.493 0.194 0.028 23.645 4.106

T0820-D1 0.264 0.028 14.864 1.697 0.301 0.033 15.152 2.485

T0824-D1 0.206 0.023 14.547 1.177 0.259 0.021 14.306 1.146

T0827-D2 0.201 0.031 18.667 2.178 0.277 0.041 16.813 1.685

T0831-D2 0.214 0.023 26.869 5.949 0.243 0.026 23.496 3.312

T0832-D1 0.212 0.025 19.669 2.463 0.290 0.035 18.893 2.338

T0834-D1 0.207 0.023 17.356 2.391 0.291 0.050 16.969 2.574

T0834-D2 0.218 0.026 13.909 1.899 0.279 0.032 12.698 1.673

T0836-D1 0.247 0.035 17.226 2.660 0.283 0.034 16.403 2.769

T0837-D1 0.249 0.034 14.602 1.957 0.365 0.055 11.884 2.539

T0855-D1 0.267 0.037 13.806 1.741 0.379 0.067 11.413 2.634

Average 0.198 0.025 19.980 2.678 0.259 0.034 17.995 2.505

Supplementary Figure 3. Density of TM-score and RMSD of the FRAGSION and ROSETTA models. X-axis represents TM-score (a) and RMSD (b) and Y-axis represents density of models. The mean TM-score for FRAGSION and ROSETTA are 0.198 and 0.259 respectively with the standard deviation 0.025 and 0.034 respectively. The mean RMSD for FRAGSION and ROSETTA are 19.98 Å and 17.995 Å respectively with the standard deviation 2.678 Å and 2.505 Å respectively.

Evaluation of the best predictions

To investigate how the quality of the best prediction is affected by the choice of fragment library, we identified the highest TM-score and lowest RMSD prediction generated by FRAGSION and ROSETTA after comparing with the corresponding experimental domains. In Supplementary Table 3, we report the performance of FRAGSION and ROSETTA in terms of best prediction. The assessment offers some interesting insights. For four targets, ROSETTA achieved TM-score higher than 0.5 indicating correctness in the overall fold; while FRAGSION’s TM-score for those targets were lower than ROSETTA. Nevertheless, for targets T0837-D1 and T0855-D1, the best models produced by FRAGSION reach close to 0.5 TM-score. Over the entire dataset, ROSETTA outperformed FRAGSION in terms of TM-score. In terms of RMSD, FRAGSION outperformed ROSETTA for six targets. For example, in case of target T0836-D1, FRAGSION generated a model having RMSD of 11.6 Å while ROSETTA’s best prediction has an RMSD of 15 Å.

Supplementary Table 3. Highest TM-score and lowest RMSD for each target by FRAGSION and ROSETTA. Numbers in bold indicate that the best prediction by FRAGSION is better than ROSETTA.

Target FRAGSION ROSETTATM-score RMSD TM-score RMSD

T0761-D1 0.2904 15.152 0.3552 9.799T0761-D2 0.2749 16.455 0.2867 13.103T0763-D1 0.2597 16.977 0.2768 16.065T0767-D2 0.2475 17.802 0.3339 18.178T0771-D1 0.2526 14.48 0.3812 12.214T0777-D1 0.26 21.603 0.3321 18.986T0781-D1 0.23 21.209 0.2467 22.501T0785-D1 0.2457 11.299 0.2817 14.253T0789-D1 0.2972 19.925 0.3602 13.554T0789-D2 0.2636 18.517 0.3779 8.938T0790-D1 0.2901 16.768 0.5716 9.502T0790-D2 0.2664 13.574 0.5131 6.896T0791-D1 0.3011 18.496 0.3304 16.432T0791-D2 0.2602 19.145 0.3222 10.363T0794-D2 0.2027 20.633 0.2494 18.704T0806-D1 0.2559 17.219 0.3076 20.628T0808-D2 0.2033 22.8 0.3015 20.354T0810-D1 0.2105 21.442 0.384 9.568T0814-D1 0.2059 24.283 0.2777 18.983T0814-D2 0.2218 23.425 0.2877 20.553T0820-D1 0.3386 12.592 0.4433 11.923T0824-D1 0.2843 13.067 0.319 15.194T0827-D2 0.3209 18.583 0.3712 14.908T0831-D2 0.2777 25.916 0.3468 19.362T0832-D1 0.2688 21.315 0.4285 18.178T0834-D1 0.2802 18.811 0.4267 11.662T0834-D2 0.312 12.322 0.3985 8.534T0836-D1 0.344 11.621 0.3679 15.051T0837-D1 0.3874 8.994 0.518 7.947T0855-D1 0.3905 11.343 0.5477 5.3

Assessment of the lowest-energy predictions

To further examine the effect of fragment library on the quality of the best models, we analyzed the lowest energy models generated by FRAGSION and ROSETTA using ROSETTA’s scoring function. This analysis is much more realistic than the analysis based on best models, particularly in blind structure prediction scenarios. In Supplementary Table 4, we report the performance of FRAGSION and ROSETTA in terms of lowest-energy prediction. FRAGSION outperformed ROSETTA for ten targets in terms of RMSD. For three targets, FRAGSION achieved TM-score higher than ROSETTA.

Supplementary Table 4. TM-score and RMSD of the lowest energy model for each target by FRAGSION and ROSETTA. Numbers in bold indicate that the lowest energy model by FRAGSION is better than ROSETTA.

Target FRAGSION ROSETTATM-score RMSD TM-score RMSD

T0761-D1 0.1961 15.741 0.2219 11.94T0761-D2 0.2 17.757 0.2111 18.704T0763-D1 0.2028 17.131 0.2198 16.226T0767-D2 0.1526 22.611 0.1863 23.534T0771-D1 0.2238 18.676 0.2405 18.23T0777-D1 0.1833 24.693 0.2476 18.281T0781-D1 0.1902 20.468 0.2055 22.139T0785-D1 0.1928 14.399 0.1953 15.0T0789-D1 0.1479 24.265 0.2458 17.642T0789-D2 0.1912 18.436 0.3332 16.383T0790-D1 0.1895 18.12 0.3467 15.766T0790-D2 0.1921 19.35 0.2896 14.4T0791-D1 0.188 20.723 0.2481 20.853T0791-D2 0.1531 24.013 0.2392 24.116T0794-D2 0.1723 37.499 0.1553 29.063T0806-D1 0.2197 23.53 0.2695 16.616T0808-D2 0.1563 30.448 0.207 23.654T0810-D1 0.1444 21.869 0.293 12.353T0814-D1 0.1445 27.472 0.1656 22.183T0814-D2 0.1283 24.247 0.2098 19.712T0820-D1 0.2943 15.31 0.2823 15.001T0824-D1 0.2843 13.067 0.2451 13.974T0827-D2 0.2103 21.881 0.2846 13.198T0831-D2 0.22 21.711 0.2512 23.117T0832-D1 0.2096 18.784 0.2772 21.337T0834-D1 0.1917 16.025 0.2546 20.177T0834-D2 0.2107 12.965 0.3086 14.781T0836-D1 0.2469 18.624 0.3626 16.68T0837-D1 0.2638 14.599 0.4043 11.813T0855-D1 0.2568 14.803 0.3909 10.213

Supplementary Figure 4. Target by target comparison between FRAGSION and ROSETTA in terms of precision. Precision at various RMSD cutoffs for each target in the dataset generated by FRAGSION (red) and ROSETTA (blue).

Supplementary Figure 5. Target by target comparison between FRAGSION and ROSETTA in terms of coverage. Coverage at various RMSD cutoffs for each target in the dataset generated by FRAGSION (red) and ROSETTA (blue).

Supplementary Figure 6. Target by target comparison between FRAGSION and ROSETTA in terms of RMSD. RMSD at different fragment lengths for each target in the dataset generated by FRAGSION (red) and ROSETTA (blue).

Supplementary Figure 7. Target by target comparison between FRAGSION and ROSETTA in terms of computation time. Computation time at different fragment lengths for each target in the dataset generated by FRAGSION (red) and ROSETTA (blue).

References

Bhattacharya, D. and Cheng, J. (2015) De novo protein conformational sampling using a probabilistic graphical model, Scientific reports, 5.Burnham, K.P. and Anderson, D.R. (2002) Model selection and multimodel inference: a practical information-theoretic approach. Springer Science & Business Media.Cawley, S.L. and Pachter, L. (2003) HMM sampling and applications to gene finding and alternative splicing, Bioinformatics, 19, ii36-ii41.Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices, Journal of molecular biology, 292, 195-202.Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features, ‐Biopolymers, 22, 2577-2637.Karplus, P.A. (1996) Experimentally observed conformation dependent geometry and hidden strain in proteins, ‐ Protein Science, 5, 1406-1420.Leaver-Fay, A., et al. (2011) ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules, Methods in enzymology, 487, 545.Lovell, S.C., et al. (2003) Structure validation by C geometry: , and C deviation, α ϕ ψ β Proteins: Structure, Function, and Bioinformatics, 50, 437-450.Mardia, K.V. and Jupp, P.E. (2009) Directional statistics. John Wiley & Sons.Mardia, K.V., Taylor, C.C. and Subramaniam, G.K. (2007) Protein bioinformatics and mixtures of bivariate von Mises distributions for angular data, Biometrics, 63, 505-512.Nielsen, S.F. (2000) The stochastic EM algorithm: estimation and asymptotic results, Bernoulli, 457-489.Paluszewski, M. and Hamelryck, T. (2010) Mocapy++-A toolkit for inference and learning in dynamic Bayesian networks, BMC bioinformatics, 11, 126.Van Walle, I., Lasters, I. and Wyns, L. (2005) SABmark—a benchmark for sequence alignment that covers the entire known fold space, Bioinformatics, 21, 1267-1268.Zhang, Y. and Skolnick, J. (2004) Scoring function for automated assessment of protein structure template quality, Proteins: Structure, Function, and Bioinformatics, 57, 702-710.

dzb0050/pubs/2016_3_support.… · Web viewSupplementary Information for “FRAGSION: ultra-fast...

Documents

Transcript of dzb0050/pubs/2016_3_support.… · Web viewSupplementary Information for “FRAGSION: ultra-fast...