Post on 11-May-2015
Computational Protein Design2. Computational Protein Design Techniques
Pablo Carbonellpablo.carbonell@issb.genopole.fr
iSSB, Institute of Systems and Synthetic BiologyGenopole, University d’Évry-Val d’Essonne, France
mSSB: December 2010
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 1 / 45
Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 2 / 45
Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 3 / 45
Computational Protein Design
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 4 / 45
A Blueprint of CPD Approaches
∗RS : research studiesPablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 5 / 45
Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 6 / 45
Molecular Signature Descriptors
A 2D representation of the molecular graphsas an undirected colored graphs G(V ,E ,C),with V : atoms, E : bonds, C : atom type
The signature descriptor of height h of atom xin the molecular graph G, or hσ(x), is acanonical representation of the subgraph ofG containing all atoms that are at distance hfrom x
Atomic signature :
hσ(G) =Xx∈V
hσ(x) (1)
The signature is a systematiccodification of the moleculargraph [Faulon et al., 2004]
σ(methylcyclopropane) =
1 [C]([H][C]([H][H][C,0])[C,0]([H][H])[C]([H][H][H]))2 [C]([H][H][C]([H][C,0][C]([H][H][H]))[C,0]([H][H]))1 [C]([H][H][H][C]([H][C]([H][H][C,0])[C,0]([H][H])))1 [H]([C]([C]([H][H][C,0])[C,0]([H][H])[C]([H][H][H])))4 [H]([C]([H][C]([H][C,0][C]([H][H][H]))[C,0]([H][H])))3 [H]([C]([H][H][C]([H][C]([H][H][C,0])[C,0]([H][H]))))
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 7 / 45
Molecular Signature of Reactions and Proteins
Signature of a reaction. The signature of reaction R
S1 + S2 + . . .+ Sn → P1 + P2 + . . .+ Pn (2)
that transforms n substrates into m products is given by the difference between thesignature of the products and the signature of the substrates:
hσ(R) =Xp∈P
hσ(p)−Xs∈S
hσ(s) (3)
Signature of protein sequences. The protein P is represented by the linearchain given by its collapsed graph at residue level, a reduced molecular graphrepresentation G(V ,E ,C) known as string signature where V : residues a ∈ A,E : contiguous in sequence, C : amino acid type
hσ(P) =Xa∈A
hσ(a) (4)
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 8 / 45
Protein Contact Maps
The protein contact map is a graphrepresentation of the 3D interactionsat residue level G(V ,E ,C) where V :residues, E : contacts, C : amino acidtype
Two residues are considered tointeract when atoms between bothresidues are at a distance lower than apredetermined threshold (tipically4.5 ∼ 5 Å)
Contact maps can account forlong-range interactions andconformational states
Song et al. [2010]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 9 / 45
Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 10 / 45
Sequence and Structure-Based CPD
Sequence-based CPD methods are in some cases a good trade-off betweencomplexity of the model and accuracy of the predictions
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 11 / 45
Sequence-based Knowledge-based potentials
The simplest way to score a protein and to identify active regions is through aminoacid scales or indexesAAindex is a database of
544 amino acid indexes94 Amino Acid Matrices47 amino acid pair-wise contact potentials
Examples: hydrophobicity,accessibility, van der Waals volume,secondary structure propensity,flexibility
This approach is widely used whenanalyzing conserved motifs andcorrelated mutations in protein foldfamilies through multiple alignments
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 12 / 45
Quantitative Structure-Activity Relationship (QSAR) Techniques
QSAR is a statistical method usedextensively by the chemical andpharmaceutical industries insmall-molecules and peptideoptimization
The goal is to model causal relationshipsbetween
structures of interacting molecules
measurables properties of scientificor commercial interest such asADME/Tox (absorption, distribution,metabolism, excretion, and toxicity) ofdrugs
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 13 / 45
QSAR Model Evaluation
Model predictability is generally evaluated through the leave-one-out (LOO)cross-validation correlation coefficient q2
Partial least-squares (PLS) regression is commonly used
Additional nonlinear terms can be added through the use of nonlinear regressionor machine learning techniques (kernel methods, random forests, etc)
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 14 / 45
QSAR Modeling Workflow
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 15 / 45
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 16 / 45
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 17 / 45
The ProSAR Algorithm
An extension of SAR-based approaches to CPD
It formalizes the decision-making processes about which mutations to include incombinatorial libraries
y =NX
i=1
Xj∈A
cijxij (5)
y : the predicted function (activity) of the protein sequencecij : the regression coefficients corresponding to the mutational effect of having residuej among the 20 amino acids A at postion ixij : binary variable indicating the presence or absence of residue j at position i
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 18 / 45
Improving Catalytic Function by ProSAR-driven Enzyme Evolution
Codexis Inc.
Statistical analysis of protein sequenceactivity relationships
Bacterial biocatalysis ofAtorvastatin (Lipitor)
(cholesterol-lowering drug)
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 19 / 45
Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 20 / 45
Structure-based CPD
Energy functions and molecular force fields
Local conformational restrictions
Predicting entropic factors
Protein topological properties
From Narasimhan et al. [2010]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 21 / 45
Energy Functions and Molecular Force Fields
In structure-based CPD, folds are usuallyrepresented by the spatial coordinates of thebackbone atoms or design scaffoldProtein design is done by amino acid sidechains along the scaffold
Side chains are only permitted to assume adiscrete set of statistically preferredconformations: rotamersRotamer/backbone and rotamer/rotamerinteraction energies are tabulated
These potential energies can then beapproximated by using any of the standardforce fields : CHARMM, AMBER, GROMOS
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 22 / 45
Molecular Force Fields
AMBER: a classical force field for energy and MD calculations:
V (rN) =Xbonds
12
kb(l − l0)2 +X
angles
12
ka(θ − θ0)2 +X
torsions
12
Vn[1 + cos(nω − γ)]
+N−1Xj=1
NXi=j+1
(εi,j
"„r0ij
rij
«12
− 2„
r0ij
rij
«6#
+qiqj
4πε0rij
)(6)
1P
bonds(·): energy between covalently bonded atoms.2P
angles(·): energy due to the geometry of electron orbitals involved in covalentbonding.
3P
torsions(·): energy for twisting a bond due to bond order (e.g. double bonds) andneighboring bonds or lone pairs of electrons.
4PN−1
j=1
PNi=j+1(·): non-bonded energy between all atom pairs:
1 van der Waals energies2 Electrostatic energies
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 23 / 45
Structure-based Knowledge-based Potentials
They are built by performing a large-scale statistical study of structural databasessuch as PDB (Protein Data Bank)
Rotamer libraries (∼ 150 rotameric states)Binary patterning: only some type of amino acids are allowed based on thehydrophobic environmentAn implicit solvation modelSecondary structure propensityFrequency of small segments in the PDBPairwise potentialsvan der Waals interactionsHydrogen bondingElectrostaticsEntropy-based penalties for flexible side-chains
From Boas and Harbury [2007]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 24 / 45
Energy Functions
Design along the backbone or scaffoldRotamer/backbone and rotamer/rotamer interact. energies tabulated
Precomputed from molecular force fields : CHARMM, AMBER, GROMOS
Total energy of the protein
ETOT =X
k
Ek (rk ) +Xk 6=l
Ekl (rk , rl ) (7)
N : length of the protein
rk : the rotamer of the kth side chain
Ek (rk ) : the self-energy of a particular rotamer rk
Ekl (rk , rl ) : the pair energy of rotamers rk , rj
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 25 / 45
The Role of Dynamics
Besides protein structure, protein dynamics can play a direct role in molecularrecognition
Flexible proteins recognize their targets through induced fit or conformationalselection, likely showing promiscuity
Binding is commonly enthalpy-driven, but in some cases entropy is important, forinstance:
Proteins with multiple binding sitesSmall hydrophobic molecules
Two types of source of protein motions:Protein flexibility: intraconformational dynamics (fast time scale motions)Conformational heterogeneity: interconformational dynamics
Gibbs free energy:
∆G = ∆H − T ∆S (8)
∆S = ∆Ssolv + ∆Sconf + ∆Srt (9)
∆Sconf : conformational entropy of protein and ligand
∆Srtf : rotational and translational degree of freedoms
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 26 / 45
Predicting Side-chain Dynamics from Structural Descriptors
The Lipari-Szabo model free approach approach allows to quantify motions fromNMR experiments by computing the generalized order parameter S2
Protein backbone dynamics : 15NH and 13CαH NMR relaxation methodsProtein side chain methyl dynamics : 13CαH NMR relaxation methods (side-chainmotions in the picosecond-to-nanosecond time regime)
From the BMRB we compiled S2 data for 18 proteins, including 10 proteins in 2 ormore different states : calmodulin, barnase, pdz, mup, dfhr, staphylococcalnuclease, pin1, sh3 domain, MSG
This technique provides only measurements for the Cα of methyl groups in sidechains : ALA, LEU, ILE, MET, THR, VAL
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 27 / 45
Structural Descriptors of Methyl Dynamics
We consider the following parameters influencing side-chain dynamics :Packing density at the methyl site i and its neighboring residues j within a sphere ofr = 5 Å
Pi =X
rij<5Å
Cj e−rij =X
rij<5Å
0B@ Xrjk<5Å
e−rjk
1CA e−rij (10)
Side chain stiffness : number of dihedral angles separating the backbone from themethyl carbon. weighted by the side-chain packingRotameric state : angular distance ∆χ = χ− χ0 to the closest rotameric state χ0 inthe libraryElongation : distance from the methyl site to the CαPairwise contact potential : a knowledge-based potential of frequence of contactsbetween residues at several distances computed from the PDBSolvation effect : DSSP accessibility and residue hydrophobicityVan der Waals contactsHydrogen bonds (in the case of Threonine)
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 28 / 45
Predicting Methyl Side-chain Dynamics
Algorithm : neural networkCross-validation : r = 0.71± 0.029(p-value = 4.6× 10−87)
Protein MD method r (MD) r (nnet)
ubiquitin AMBER99SB 0.81 0.81TNfn3 CHARMM 22 0.62 0.79FNfn10 CHARMM 22 0.51 0.64barnase OPLS-AA/L 0.55 0.64calmodulin FDPB 0.60 0.72
Example : experimental and predictedchanges in ∆S2 of barnase after bindingbarstar
∆S2 > 0 ∆S2 < 0
rigidification flexibilization
[Carbonell and del Sol, 2009]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 29 / 45
Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 30 / 45
Search Algorithms in CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 31 / 45
Search Algorithms
Objective: finding the best design within the space of all possible aminoacid/rotameric states
A vast search space: 20N or pN
N: number of positions to mutatep: number of rotameric states
StrategiesDeterministic algorithms
Dead-end elimination (DEE) algorithm: a pruning method.Some accelerations of the DEE algorithm: upper-bound estimation; the “magic bullet” metric;conformational splitting; background optimization
Stochastic algorithmsMonte CarloSimulated annealingGenetic algorithms
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 32 / 45
The DEE Algorithm
It assumes that the energy of the protein can be written as
ETOT =X
k
Ek (rk ) +Xk 6=l
Ekl (rk , rl ) (11)
N : length of the protein
rk : the rotamer of the kth side chain
Ek (rk ):" the self-energy of a particular rotamer rk
Ekl (rk , rl ): the pair energy of the rotamers rk , rj
Complexity:Single search scales quadratically with total number of rotamers O((p × N)2)Pair search scales cubically O((p × N)3)Brute force enumeration : O(pN )
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 33 / 45
The DEE Algorithm
Single rotamers and rotamer pairs are eliminated during the computational cyclesSingle elimination : eliminate rotamer if some other rotamer in the side chain givesbetter energy
Ek (rAk ) +
NXl=1
minX
Ekl (rAk , r
Xl ) > Ek (rB
k ) +NX
l=1
maxX
Ekl (rBk , r
Xl ) (12)
Pairs elimination : eliminate pair of rotamers in two positions if there exists anotherpair that gives better energy
UABkl
def= Ek (rA
k ) + El (rBl ) + Ekl (rA
k , rBl ) (13)
UABkl +
NXi=1
minX
“Eki (rA
k , rXi ) + Elj (rB
l , rXj )”>
UCDkl +
NXi=1
maxX
“Eki (rC
k , rXi ) + Elj (rD
l , rXj )”
(14)
Values are precomputed and stored in energy matrices
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 34 / 45
Stochastic Algorithms
Search in the space of feasible designs by making a series of combinations ofrandom and directed moves
Monte Carlo Metropolis: a move consists of exchanging one rotamer for anotherat a randomly chosen position, a modification is accepted if it lowers the energy
Simulated Annealing allows to explore nearby solutions at the initial cycles of thesearch
Genetic Algorithms: a population of models is propagated (evolved) throughoutthe course of the run and genetic operators, such as recombination, are used tocreate new models from existing parents
They are fast, can be scaled up to problems of large complexity
They are not guaranteed to converge to the optimal solution
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 35 / 45
The SCHEMA Algorithm
Equivalent to an in silico directed evolutionConsists of scoring libraries of hybrid proteinsequences against the parental sequenceScoring:
Calculate the number of interactions between residues(contacts within 4.5 Å) that are disrupted in the creationof hybrid proteinsHybrids are scored for stability by counting the number ofdisruptionsProtein is partitioned into blocks that should notinterrupted by crossovers (analog to genetic algorithms) From [Meyer et al., 2006]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 36 / 45
The OPTCOM and IPRO Algorithms for Library Design
The OPTCOM algorithm:Balances size andquality of the library
The IPRO algorithm:Identify point mutations in the parent sequencesusing energy-based scoring fuctionsResidue and rotamer choices are driven by amixed-integer linear programming formulation(MILP)
From [Saraf et al., 2006]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 37 / 45
Some Web Resources
IPRO: Iterative Protein Redesign and Optimization.http://maranas.che.psu.edu/IPRO.htm
EGAD: A Genetic Algorithm for protein Design.http://egad.ucsd.edu/software.php
RosettaDesign: A software package.http://rosettadesign.med.unc.edu/
SCHEMA A pair-wise energy function for scoring protein chimeras made fromhomologous proteins. http://www.che.caltech.edu/groups/fha/schema-tools/schema-overview.html
SHARPEN: Systematic Hierarchical Algorithms for Rotamers and Proteins onan Extended Network.http://koko.che.caltech.edu/sharpenabout.html
WHAT IF: Software for protein modelling, design, validation, andvisualisation. http://swift.cmbi.ru.nl/whatif/
FoldX: A force field for energy calculations and protein design.http://foldx.crg.es/
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 38 / 45
Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 39 / 45
De Novo-Designed Proteins
In de novo designs, some assumptions are needed in order to make the searchspace tractable
Usually we start from some basic motifs or domains as scaffolds for the design
Examples:βαβ motif resembling a zinc finger3 and 4 helix bundlesHelical coiled-coils
Helix bundle motifs can be parametrized using a few global variables thatdescribe the global structure
Applications:New metal-binding sitesNonbiological cofactors for novel biomaterials and electromechanical devicesNovel enzymatic activities
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 40 / 45
Example: De Novo Design of a Metalloprotein
Computational de novo design of a four-helix (108 residues) bundle containing thenon-biological cofactor iron diphenyl porphyrin (DPP-Fe) [Bender et al., 2007]
The initial helix bundle was selected as low-energy structure computed with MCSASTITCH: a program to select loops connecting helices from PDB SelectCHARMM and PROCHECK for removing overlaps4 His and the 4 Thr residues to support the 6-point coordination of the Fe(III) cationsSCADS: provides side-dependent amino acid probabilities in each round
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 41 / 45
Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 42 / 45
Challenges in Sequence and Structure-Based CPD
ModelingGreater availability of 3D protein structural informationMore accurate energy functionsImprovement of rigid and flexible docking
DesignImprovement in search algorithmsParametrization for non-natural amino acids
PredictionBeyond additive models: using machine-learning algorithms
More complete environment descriptors
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 43 / 45
Computational Protein Design2. Computational Protein Design Techniques
Pablo Carbonellpablo.carbonell@issb.genopole.fr
iSSB, Institute of Systems and Synthetic BiologyGenopole, University d’Évry-Val d’Essonne, France
mSSB: December 2010
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 44 / 45
Bibliography I
Gretchen M. Bender, Andreas Lehmann, Hongling Zou, Hong Cheng, H. Christopher Fry, Don Engel, Michael J. Therien, J. Kent Blasie, Heinrich Roder,Jeffrey G. Saven, and William F. DeGrado. De Novo Design of a Single-Chain Diphenylporphyrin Metalloprotein. Journal of the American ChemicalSociety, 129(35):10732–10740, September 2007. ISSN 0002-7863. doi: 10.1021/ja071199j. URL http://dx.doi.org/10.1021/ja071199j.
F. Edward Boas and Pehr B. Harbury. Potential energy functions for protein design. Current opinion in structural biology, 17(2):199–204, April 2007. ISSN0959-440X. doi: 10.1016/j.sbi.2007.03.006. URL http://dx.doi.org/10.1016/j.sbi.2007.03.006.
Pablo Carbonell and Antonio del Sol. Methyl side-chain dynamics prediction based on protein structure. Bioinformatics, pages btp463+, July 2009. doi:10.1093/bioinformatics/btp463. URL http://dx.doi.org/10.1093/bioinformatics/btp463.
Jean-Loup L. Faulon, Michael J. Collins, and Robert D. Carr. The signature molecular descriptor. 4. Canonizing molecules using extended valencesequences. Journal of chemical information and computer sciences, 44(2):427–436, 2004. ISSN 0095-2338. doi: 10.1021/ci0341823. URLhttp://dx.doi.org/10.1021/ci0341823.
Michelle M. Meyer, Lisa Hochrein, and Frances H. Arnold. Structure-guided SCHEMA recombination of distantly related β-lactamases. Protein EngineeringDesign and Selection, 19(12):563–570, December 2006. ISSN 1741-0126. doi: 10.1093/protein/gzl045. URLhttp://dx.doi.org/10.1093/protein/gzl045.
Diwahar Narasimhan, Mark R. Nance, Daquan Gao, Mei-Chuan Ko, Joanne Macdonald, Patricia Tamburi, Dan Yoon, Donald M. Landry, James H. Woods,Chang-Guo Zhan, John J. G. Tesmer, and Roger K. Sunahara. Structural analysis of thermostabilizing mutations of cocaine esterase. ProteinEngineering Design and Selection, 23(7):537–547, July 2010. doi: 10.1093/protein/gzq025. URL http://dx.doi.org/10.1093/protein/gzq025.
Manish C. Saraf, Gregory L. Moore, Nina M. Goodey, Vania Y. Cao, Stephen J. Benkovic, and Costas D. Maranas. IPRO: an iterative computational proteinlibrary redesign and optimization procedure. Biophysical journal, 90(11):4167–4180, June 2006. ISSN 0006-3495. doi: 10.1529/biophysj.105.079277. URLhttp://dx.doi.org/10.1529/biophysj.105.079277.
Jiangning Song, Kazuhiro Takemoto, Hongbin Shen, Hao Tan, Michael M. Gromiha, and Tatsuya Akutsu. Prediction of Protein Folding Rates from StructuralTopology and Complex Network Properties. IPSJ Transactions on Bioinformatics, 3:40–53, 2010. doi: 10.2197/ipsjtbio.3.40. URLhttp://dx.doi.org/10.2197/ipsjtbio.3.40.
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 45 / 45