Computational Protein Design. 2. Computational Protein Design Techniques

Computational Protein Design2. Computational Protein Design Techniques

Pablo Carbonellpablo.carbonell@issb.genopole.fr

iSSB, Institute of Systems and Synthetic BiologyGenopole, University d’Évry-Val d’Essonne, France

mSSB: December 2010

Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 1 / 45

Outline

1 Introduction

2 Computational Protein Descriptors

3 Sequence-based CPD

4 Structure-based CPD

5 Search Algorithms in CPD

6 De Novo Design

7 Challenges in Sequence and Structure-Based CPD

Outline

1 Introduction

6 De Novo Design

Computational Protein Design

A Blueprint of CPD Approaches

∗RS : research studiesPablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 5 / 45

Outline

1 Introduction

6 De Novo Design

Molecular Signature Descriptors

A 2D representation of the molecular graphsas an undirected colored graphs G(V ,E ,C),with V : atoms, E : bonds, C : atom type

The signature descriptor of height h of atom xin the molecular graph G, or hσ(x), is acanonical representation of the subgraph ofG containing all atoms that are at distance hfrom x

Atomic signature :

hσ(G) =Xx∈V

hσ(x) (1)

The signature is a systematiccodification of the moleculargraph [Faulon et al., 2004]

σ(methylcyclopropane) =

1 [C]([H][C]([H][H][C,0])[C,0]([H][H])[C]([H][H][H]))2 [C]([H][H][C]([H][C,0][C]([H][H][H]))[C,0]([H][H]))1 [C]([H][H][H][C]([H][C]([H][H][C,0])[C,0]([H][H])))1 [H]([C]([C]([H][H][C,0])[C,0]([H][H])[C]([H][H][H])))4 [H]([C]([H][C]([H][C,0][C]([H][H][H]))[C,0]([H][H])))3 [H]([C]([H][H][C]([H][C]([H][H][C,0])[C,0]([H][H]))))

Molecular Signature of Reactions and Proteins

Signature of a reaction. The signature of reaction R

S1 + S2 + . . .+ Sn → P1 + P2 + . . .+ Pn (2)

that transforms n substrates into m products is given by the difference between thesignature of the products and the signature of the substrates:

hσ(R) =Xp∈P

hσ(p)−Xs∈S

hσ(s) (3)

Signature of protein sequences. The protein P is represented by the linearchain given by its collapsed graph at residue level, a reduced molecular graphrepresentation G(V ,E ,C) known as string signature where V : residues a ∈ A,E : contiguous in sequence, C : amino acid type

hσ(P) =Xa∈A

hσ(a) (4)

Protein Contact Maps

The protein contact map is a graphrepresentation of the 3D interactionsat residue level G(V ,E ,C) where V :residues, E : contacts, C : amino acidtype

Two residues are considered tointeract when atoms between bothresidues are at a distance lower than apredetermined threshold (tipically4.5 ∼ 5 Å)

Contact maps can account forlong-range interactions andconformational states

Song et al. [2010]

Outline

1 Introduction

6 De Novo Design

Sequence and Structure-Based CPD

Sequence-based CPD methods are in some cases a good trade-off betweencomplexity of the model and accuracy of the predictions

Sequence-based Knowledge-based potentials

The simplest way to score a protein and to identify active regions is through aminoacid scales or indexesAAindex is a database of

544 amino acid indexes94 Amino Acid Matrices47 amino acid pair-wise contact potentials

Examples: hydrophobicity,accessibility, van der Waals volume,secondary structure propensity,flexibility

This approach is widely used whenanalyzing conserved motifs andcorrelated mutations in protein foldfamilies through multiple alignments

Quantitative Structure-Activity Relationship (QSAR) Techniques

QSAR is a statistical method usedextensively by the chemical andpharmaceutical industries insmall-molecules and peptideoptimization

The goal is to model causal relationshipsbetween

structures of interacting molecules

measurables properties of scientificor commercial interest such asADME/Tox (absorption, distribution,metabolism, excretion, and toxicity) ofdrugs

QSAR Model Evaluation

Model predictability is generally evaluated through the leave-one-out (LOO)cross-validation correlation coefficient q2

Partial least-squares (PLS) regression is commonly used

Additional nonlinear terms can be added through the use of nonlinear regressionor machine learning techniques (kernel methods, random forests, etc)

QSAR Modeling Workflow

The ProSAR Algorithm

An extension of SAR-based approaches to CPD

It formalizes the decision-making processes about which mutations to include incombinatorial libraries

Xj∈A

cijxij (5)

y : the predicted function (activity) of the protein sequencecij : the regression coefficients corresponding to the mutational effect of having residuej among the 20 amino acids A at postion ixij : binary variable indicating the presence or absence of residue j at position i

Improving Catalytic Function by ProSAR-driven Enzyme Evolution

Codexis Inc.

Statistical analysis of protein sequenceactivity relationships

Bacterial biocatalysis ofAtorvastatin (Lipitor)

(cholesterol-lowering drug)

Outline

1 Introduction

6 De Novo Design

Structure-based CPD

Energy functions and molecular force fields

Local conformational restrictions

Predicting entropic factors

Protein topological properties

From Narasimhan et al. [2010]

Energy Functions and Molecular Force Fields

In structure-based CPD, folds are usuallyrepresented by the spatial coordinates of thebackbone atoms or design scaffoldProtein design is done by amino acid sidechains along the scaffold

Side chains are only permitted to assume adiscrete set of statistically preferredconformations: rotamersRotamer/backbone and rotamer/rotamerinteraction energies are tabulated

These potential energies can then beapproximated by using any of the standardforce fields : CHARMM, AMBER, GROMOS

Molecular Force Fields

AMBER: a classical force field for energy and MD calculations:

V (rN) =Xbonds

kb(l − l0)2 +X

angles

ka(θ − θ0)2 +X

torsions

Vn[1 + cos(nω − γ)]

+N−1Xj=1

NXi=j+1

(εi,j

"„r0ij

− 2„

4πε0rij

bonds(·): energy between covalently bonded atoms.2P

angles(·): energy due to the geometry of electron orbitals involved in covalentbonding.

torsions(·): energy for twisting a bond due to bond order (e.g. double bonds) andneighboring bonds or lone pairs of electrons.

4PN−1

PNi=j+1(·): non-bonded energy between all atom pairs:

1 van der Waals energies2 Electrostatic energies

Structure-based Knowledge-based Potentials

They are built by performing a large-scale statistical study of structural databasessuch as PDB (Protein Data Bank)

Rotamer libraries (∼ 150 rotameric states)Binary patterning: only some type of amino acids are allowed based on thehydrophobic environmentAn implicit solvation modelSecondary structure propensityFrequency of small segments in the PDBPairwise potentialsvan der Waals interactionsHydrogen bondingElectrostaticsEntropy-based penalties for flexible side-chains

From Boas and Harbury [2007]

Energy Functions

Design along the backbone or scaffoldRotamer/backbone and rotamer/rotamer interact. energies tabulated

Precomputed from molecular force fields : CHARMM, AMBER, GROMOS

Total energy of the protein

ETOT =X

Ek (rk ) +Xk 6=l

Ekl (rk , rl ) (7)

N : length of the protein

rk : the rotamer of the kth side chain

Ek (rk ) : the self-energy of a particular rotamer rk

Ekl (rk , rl ) : the pair energy of rotamers rk , rj

The Role of Dynamics

Besides protein structure, protein dynamics can play a direct role in molecularrecognition

Flexible proteins recognize their targets through induced fit or conformationalselection, likely showing promiscuity

Binding is commonly enthalpy-driven, but in some cases entropy is important, forinstance:

Proteins with multiple binding sitesSmall hydrophobic molecules

Two types of source of protein motions:Protein flexibility: intraconformational dynamics (fast time scale motions)Conformational heterogeneity: interconformational dynamics

Gibbs free energy:

∆G = ∆H − T ∆S (8)

∆S = ∆Ssolv + ∆Sconf + ∆Srt (9)

∆Sconf : conformational entropy of protein and ligand

∆Srtf : rotational and translational degree of freedoms

Predicting Side-chain Dynamics from Structural Descriptors

The Lipari-Szabo model free approach approach allows to quantify motions fromNMR experiments by computing the generalized order parameter S2

Protein backbone dynamics : 15NH and 13CαH NMR relaxation methodsProtein side chain methyl dynamics : 13CαH NMR relaxation methods (side-chainmotions in the picosecond-to-nanosecond time regime)

From the BMRB we compiled S2 data for 18 proteins, including 10 proteins in 2 ormore different states : calmodulin, barnase, pdz, mup, dfhr, staphylococcalnuclease, pin1, sh3 domain, MSG

This technique provides only measurements for the Cα of methyl groups in sidechains : ALA, LEU, ILE, MET, THR, VAL

Structural Descriptors of Methyl Dynamics

We consider the following parameters influencing side-chain dynamics :Packing density at the methyl site i and its neighboring residues j within a sphere ofr = 5 Å

rij<5Å

Cj e−rij =X

rij<5Å

0B@ Xrjk<5Å

e−rjk

1CA e−rij (10)

Side chain stiffness : number of dihedral angles separating the backbone from themethyl carbon. weighted by the side-chain packingRotameric state : angular distance ∆χ = χ− χ0 to the closest rotameric state χ0 inthe libraryElongation : distance from the methyl site to the CαPairwise contact potential : a knowledge-based potential of frequence of contactsbetween residues at several distances computed from the PDBSolvation effect : DSSP accessibility and residue hydrophobicityVan der Waals contactsHydrogen bonds (in the case of Threonine)

Predicting Methyl Side-chain Dynamics

Algorithm : neural networkCross-validation : r = 0.71± 0.029(p-value = 4.6× 10−87)

Protein MD method r (MD) r (nnet)

ubiquitin AMBER99SB 0.81 0.81TNfn3 CHARMM 22 0.62 0.79FNfn10 CHARMM 22 0.51 0.64barnase OPLS-AA/L 0.55 0.64calmodulin FDPB 0.60 0.72

Example : experimental and predictedchanges in ∆S2 of barnase after bindingbarstar

∆S2 > 0 ∆S2 < 0

rigidification flexibilization

[Carbonell and del Sol, 2009]

Outline

1 Introduction

6 De Novo Design

Search Algorithms in CPD

Search Algorithms

Objective: finding the best design within the space of all possible aminoacid/rotameric states

A vast search space: 20N or pN

N: number of positions to mutatep: number of rotameric states

StrategiesDeterministic algorithms

Dead-end elimination (DEE) algorithm: a pruning method.Some accelerations of the DEE algorithm: upper-bound estimation; the “magic bullet” metric;conformational splitting; background optimization

Stochastic algorithmsMonte CarloSimulated annealingGenetic algorithms

The DEE Algorithm

It assumes that the energy of the protein can be written as

ETOT =X

Ek (rk ) +Xk 6=l

Ekl (rk , rl ) (11)

N : length of the protein

rk : the rotamer of the kth side chain

Ek (rk ):" the self-energy of a particular rotamer rk

Ekl (rk , rl ): the pair energy of the rotamers rk , rj

Complexity:Single search scales quadratically with total number of rotamers O((p × N)2)Pair search scales cubically O((p × N)3)Brute force enumeration : O(pN )

The DEE Algorithm

Single rotamers and rotamer pairs are eliminated during the computational cyclesSingle elimination : eliminate rotamer if some other rotamer in the side chain givesbetter energy

Ek (rAk ) +

Ekl (rAk , r

Xl ) > Ek (rB

k ) +NX

Ekl (rBk , r

Xl ) (12)

Pairs elimination : eliminate pair of rotamers in two positions if there exists anotherpair that gives better energy

def= Ek (rA

k ) + El (rBl ) + Ekl (rA

k , rBl ) (13)

UABkl +

“Eki (rA

k , rXi ) + Elj (rB

l , rXj )”>

UCDkl +

“Eki (rC

k , rXi ) + Elj (rD

l , rXj )”

Values are precomputed and stored in energy matrices

Stochastic Algorithms

Search in the space of feasible designs by making a series of combinations ofrandom and directed moves

Monte Carlo Metropolis: a move consists of exchanging one rotamer for anotherat a randomly chosen position, a modification is accepted if it lowers the energy

Simulated Annealing allows to explore nearby solutions at the initial cycles of thesearch

Genetic Algorithms: a population of models is propagated (evolved) throughoutthe course of the run and genetic operators, such as recombination, are used tocreate new models from existing parents

They are fast, can be scaled up to problems of large complexity

They are not guaranteed to converge to the optimal solution

The SCHEMA Algorithm

Equivalent to an in silico directed evolutionConsists of scoring libraries of hybrid proteinsequences against the parental sequenceScoring:

Calculate the number of interactions between residues(contacts within 4.5 Å) that are disrupted in the creationof hybrid proteinsHybrids are scored for stability by counting the number ofdisruptionsProtein is partitioned into blocks that should notinterrupted by crossovers (analog to genetic algorithms) From [Meyer et al., 2006]

The OPTCOM and IPRO Algorithms for Library Design

The OPTCOM algorithm:Balances size andquality of the library

The IPRO algorithm:Identify point mutations in the parent sequencesusing energy-based scoring fuctionsResidue and rotamer choices are driven by amixed-integer linear programming formulation(MILP)

From [Saraf et al., 2006]

Some Web Resources

IPRO: Iterative Protein Redesign and Optimization.http://maranas.che.psu.edu/IPRO.htm

EGAD: A Genetic Algorithm for protein Design.http://egad.ucsd.edu/software.php

RosettaDesign: A software package.http://rosettadesign.med.unc.edu/

SCHEMA A pair-wise energy function for scoring protein chimeras made fromhomologous proteins. http://www.che.caltech.edu/groups/fha/schema-tools/schema-overview.html

SHARPEN: Systematic Hierarchical Algorithms for Rotamers and Proteins onan Extended Network.http://koko.che.caltech.edu/sharpenabout.html

WHAT IF: Software for protein modelling, design, validation, andvisualisation. http://swift.cmbi.ru.nl/whatif/

FoldX: A force field for energy calculations and protein design.http://foldx.crg.es/

Outline

1 Introduction

6 De Novo Design

De Novo-Designed Proteins

In de novo designs, some assumptions are needed in order to make the searchspace tractable

Usually we start from some basic motifs or domains as scaffolds for the design

Examples:βαβ motif resembling a zinc finger3 and 4 helix bundlesHelical coiled-coils

Helix bundle motifs can be parametrized using a few global variables thatdescribe the global structure

Applications:New metal-binding sitesNonbiological cofactors for novel biomaterials and electromechanical devicesNovel enzymatic activities

Example: De Novo Design of a Metalloprotein

Computational de novo design of a four-helix (108 residues) bundle containing thenon-biological cofactor iron diphenyl porphyrin (DPP-Fe) [Bender et al., 2007]

The initial helix bundle was selected as low-energy structure computed with MCSASTITCH: a program to select loops connecting helices from PDB SelectCHARMM and PROCHECK for removing overlaps4 His and the 4 Thr residues to support the 6-point coordination of the Fe(III) cationsSCADS: provides side-dependent amino acid probabilities in each round

Outline

1 Introduction

6 De Novo Design

Challenges in Sequence and Structure-Based CPD

ModelingGreater availability of 3D protein structural informationMore accurate energy functionsImprovement of rigid and flexible docking

DesignImprovement in search algorithmsParametrization for non-natural amino acids

PredictionBeyond additive models: using machine-learning algorithms

More complete environment descriptors

Computational Protein Design2. Computational Protein Design Techniques

Pablo Carbonellpablo.carbonell@issb.genopole.fr

iSSB, Institute of Systems and Synthetic BiologyGenopole, University d’Évry-Val d’Essonne, France

mSSB: December 2010

Bibliography I

Gretchen M. Bender, Andreas Lehmann, Hongling Zou, Hong Cheng, H. Christopher Fry, Don Engel, Michael J. Therien, J. Kent Blasie, Heinrich Roder,Jeffrey G. Saven, and William F. DeGrado. De Novo Design of a Single-Chain Diphenylporphyrin Metalloprotein. Journal of the American ChemicalSociety, 129(35):10732–10740, September 2007. ISSN 0002-7863. doi: 10.1021/ja071199j. URL http://dx.doi.org/10.1021/ja071199j.

F. Edward Boas and Pehr B. Harbury. Potential energy functions for protein design. Current opinion in structural biology, 17(2):199–204, April 2007. ISSN0959-440X. doi: 10.1016/j.sbi.2007.03.006. URL http://dx.doi.org/10.1016/j.sbi.2007.03.006.

Pablo Carbonell and Antonio del Sol. Methyl side-chain dynamics prediction based on protein structure. Bioinformatics, pages btp463+, July 2009. doi:10.1093/bioinformatics/btp463. URL http://dx.doi.org/10.1093/bioinformatics/btp463.

Jean-Loup L. Faulon, Michael J. Collins, and Robert D. Carr. The signature molecular descriptor. 4. Canonizing molecules using extended valencesequences. Journal of chemical information and computer sciences, 44(2):427–436, 2004. ISSN 0095-2338. doi: 10.1021/ci0341823. URLhttp://dx.doi.org/10.1021/ci0341823.

Michelle M. Meyer, Lisa Hochrein, and Frances H. Arnold. Structure-guided SCHEMA recombination of distantly related β-lactamases. Protein EngineeringDesign and Selection, 19(12):563–570, December 2006. ISSN 1741-0126. doi: 10.1093/protein/gzl045. URLhttp://dx.doi.org/10.1093/protein/gzl045.

Diwahar Narasimhan, Mark R. Nance, Daquan Gao, Mei-Chuan Ko, Joanne Macdonald, Patricia Tamburi, Dan Yoon, Donald M. Landry, James H. Woods,Chang-Guo Zhan, John J. G. Tesmer, and Roger K. Sunahara. Structural analysis of thermostabilizing mutations of cocaine esterase. ProteinEngineering Design and Selection, 23(7):537–547, July 2010. doi: 10.1093/protein/gzq025. URL http://dx.doi.org/10.1093/protein/gzq025.

Manish C. Saraf, Gregory L. Moore, Nina M. Goodey, Vania Y. Cao, Stephen J. Benkovic, and Costas D. Maranas. IPRO: an iterative computational proteinlibrary redesign and optimization procedure. Biophysical journal, 90(11):4167–4180, June 2006. ISSN 0006-3495. doi: 10.1529/biophysj.105.079277. URLhttp://dx.doi.org/10.1529/biophysj.105.079277.

Jiangning Song, Kazuhiro Takemoto, Hongbin Shen, Hao Tan, Michael M. Gromiha, and Tatsuya Akutsu. Prediction of Protein Folding Rates from StructuralTopology and Complex Network Properties. IPSJ Transactions on Bioinformatics, 3:40–53, 2010. doi: 10.2197/ipsjtbio.3.40. URLhttp://dx.doi.org/10.2197/ipsjtbio.3.40.

Computational Protein Design. 2. Computational Protein Design Techniques

Technology

Transcript of Computational Protein Design. 2. Computational Protein Design Techniques

Protein Computational Modelling Project - Natan Kalson

Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Computational Protein Design: A problem in combinatorial optimization

Computational prediction of protein-protein complexes

Computational modeling of Human-nCoV protein-protein ...

Computational Tools for Protein-DNA Interactions

Advances in computational protein design: Development of ...thesis.library.caltech.edu/2303/1/gh_thesis_5_30_05.pdf · Advances in computational protein design: Development of more

Computational Approaches to Protein Structure Prediction

Protein Physics by Advanced Computational … Cossio.pdf · Protein Physics by Advanced Computational Techniques: Conformational Sampling and Folded State Discrimination by Pilar

COMPUTATIONAL REPRESENTATION OF PROTEIN SEQUENCES …etd.lib.metu.edu.tr/upload/12606997/index.pdf · COMPUTATIONAL REPRESENTATION OF PROTEIN SEQUENCES FOR HOMOLOGY DETECTION AND

Computational Molecular Biology Protein Structure and ...molsim.sci.univr.it/2014_bioinfo2/Structural_Modeling.pdf · Computational Molecular Biology Protein Structure and Homology

Computational Protein Design Force Field Optimization: …thesis.library.caltech.edu/5192/1/Alvizo_Thesis_052307.pdf · Optimization: A Negative Design Approach Thesis by ... Zollars,

Computational Molecular Biology Protein - Ligand And ...molsim.sci.univr.it/2013_biocomp/Docking.pdf · Computational Molecular Biology Protein - Ligand And Protein - Protein Docking

Computational prediction of protein-protein interactions Rong Liu 2014-04-22.

Computational design and experimental verification of a … · Computational design and experimental verification of a symmetric protein homodimer Yun Moua,1, Po-Ssu Huangb,1,2, Fang-Ciao

Computational Modeling of Protein-Ligand Interactions

Protein Physics by Advanced Computational Techniques

Computational Protein Design

Computational Methods for Protein Structure Prediction

Computational Methods for Predicting Protein-Protein ...dragon.bio.purdue.edu/paper/PPIreview-Ding_Kihara-2018-Curr_Protocols.pdfComputational Methods for Predicting Protein-Protein