Protein folding
Process of folding
Modeling the process of folding
Evolution vs. folding
Impact of function on protein evolution
Process
Local Interactions
Secondary Structure Elements (SSE)
Assembly of SSE
Equilibrium Structure
Protein folding
http://www.blueprint.org/proteinfolding/trades/details/trades_movies.html
Protein folding
Important thing to note
It is possible that residues that are not doing anything in the folded protein were
actually critically important to get the peptide folded in the
first place.
Protein folding
Simulation studies are demonstrating that the most common protein folds are those who can
withstand the most sequence variation over time without
affecting their topologies. The prion protein is a posterchild example
of the opposite.
Protein Evolution
Evolutionary meaning
Most common folds are those able to
withstand point mutations the best.
These are known as designable folds.
Protein folding
Marginal stability
The most stable folds are not necessarily these
with the lowest energy.
But these that maximally penalize switching to an alternative conformation.
Protein Evolution
Marginal stability
Evolutionary implication(s)
There is thus selective pressure on residues in
protein not only to maintain important
interaction, but also to make sure that some interaction NEVER
happen.
Summary
Proteins fold into energetically stable conformations.
For one chain, there are a large number of possible conformations, however.
The biological conformation is selected during folding: not necessarily the “best” conformation.
Role of biology on structures
A few examples using mapping of rate of evolution.
The fitness of a protein is ultimately its biological function, not its structure.
We’ll have a look at their structural requirements.
Structural Biology
Outline
How genetics encode structure.
What make a protein fold.
Role of biological function on preserving a fold.
Comparing two structures for similarities.
Genetic information and proteins
3D information is encoded into (1D) sequences.
STKKKPLTQEQLEDARRLKA IYEKKKNELGLSQESVADKM GMGQSGVGALFNGINALNAY NAALLAKILKVSVEEFSPSIAREIYEMYEA
Protein structure of CRO repressor in phage Lambda, PDB: 1LMB
?
Genetic information and proteins
The encoding can only be indirect
Because there is nothing in the DNA
that tells each amino acid where to go.
Genetic information and proteins
However,
There is a few types of physical interactions that are dominating
the process of protein folding.
Amino-acidsComponentsMain Chain
Side Chains
Side ChainsResponsible for the “name”.
Can be clustered based on:
- chemical properties
- Structure
This ultimately determine the evolutionary interchangeability.
Protein folding
Van der Waal forces
The electron clouds around the nuclei are more
stable if they can lightly interact with other electron clouds.
Makes atoms sticky relative to each other.
Protein folding
Electrostatic forces
Long range interactions.
Pull/Push over longer distances.
Protein folding
Hydrogen bonds
Electrostatic. Short range, not flexible
Can be seen as the velcro holding proteins
together.
Protein folding
Hydrophobic interactions
Water molecules in liquid pack as to minimize their
energies
This implies that water molecules are more than often are doing H-bond
with their neighbors.
Protein folding
Hydrophobic interactions
If you introduce a droplet of oil in solution, many hydrogen bonds will have to be broken at the interface, at an energy cost.
This is why hydrophobic and hydrophilic groups look like they are avoiding
each other.
Protein folding
During folding,
The polypeptide has to follow a strict
sequence of event in order to find the
correct conformation in a timely fashion.
Protein folding
Secondary Structures
Stable because of local h-bonds.
Makes larger block with fewer freedom of
movement
Protein folding
Geometry plays a very important role.
Because there are only a few angles that can
change along the backbone, there is a
limited number of ways a protein can
fold onto itself.
Protein structures are organized in a Hierarchical fashion
Secondary structures - Geometry
Dihedral AngleBecause most main chain atoms are constrained in a “amide bond”, the entire trajectory of the chain can be defined by the pair of angles (for each AA):
This can be represented with a
“Ramachandran Plot”.From which it is obvious that there are some kind of clustering going-on.
,
l
l
Protein structures are organized in a Hierarchical fashion
Secondary structures – The alpha helix
The Hydrogen BondAgain, a helix is an ideal setup to place our “velcro” H-bond always at the right place.
PeriodicityTo the delight of statisticians and computer scientists.
Protein structures are organized in a Hierarchical fashion
Secondary structures – The beta strand (beta sheets)
Another periodical pattern ( )Responsible for super-structure rigidity and some truly amazing patterns.
2f
Protein structures are organized in a Hierarchical fashion
Secondary structures – The myth of “random” coil.
Random structures in protein are extremely rare.Many uses the expression anyway to refer to the “rest” of the protein.
Other minor secondary structuresTurns, loops, bridges. Although these don’t have the critical periodicity found in and structures.
Protein structures are organized in a Hierarchical fashion
Tertiary structures – The reason why to care about 2nd structures.
Secondary structures are building blocksDetecting and predicting secondary structures is a key process in structural biology.
Other usesVisualization, classification…
Protein Diversity
The current release of PDB contains 28,000
structure entries.
26,000 are proteins
There is an estimated 600-8000 possible
unique protein folds.
http://www.jacquesdeshaies.com/expositions/virtual/new-virtual/uppsala-invit.html
PDB
Overview
Repository of structuresProteins, Nucleotides, complexes, mutants
Quality improve over timeData validation tools are getting better. More redundant structure are available for cross-reference.
Small number of folds
Does this means that all proteins are
coming from a small set of ancestor
molecule?
Perhaps, but not necessarily.
Protein folding
Process of folding
Modeling the process of folding
Evolution vs. folding
Impact of function on protein evolution
Process
Local Interactions
Secondary Structure Elements (SSE)
Assembly of SSE
Equilibrium Structure
Protein folding
http://www.blueprint.org/proteinfolding/trades/details/trades_movies.html
Protein folding
Important thing to note
It is possible that residues that are not doing anything in the folded protein were
actually critically important to get the peptide folded in the
first place.
Protein folding
Simulation studies are demonstrating that the most common protein folds are those who can
withstand the most sequence variation over time without
affecting their topologies. The prion protein is a posterchild example
of the opposite.
Protein Evolution
Evolutionary meaning
Most common folds are those able to
withstand point mutations the best.
These are known as designable folds.
Protein folding
Marginal stability
The most stable folds are not necessarily these
with the lowest energy.
But these that maximally penalize switching to an alternative conformation.
Protein Evolution
Marginal stability
Evolutionary implication(s)
There is thus selective pressure on residues in
protein not only to maintain important
interaction, but also to make sure that some interaction NEVER
happen.
Summary
Proteins fold into energetically stable conformations.
For one chain, there are a large number of possible conformations, however.
The biological conformation is selected during folding: not necessarily the “best” conformation.
Role of biology on structures
A few examples using mapping of rate of evolution.
The fitness of a protein is ultimately its biological function, not its structure.
We’ll have a look at their structural requirements.
Fast Slow
Maximum-Likelihood Site-Rates are Biologically Relevant
Rhodopsin-like G-protein receptors
Pfam (dataset 1Tml_7) 69 taxa
Maximum-Likelihood Site-Rates are Biologically Relevant
Tubulin
34 taxa 33 taxa
The constraints imposed by co-evolution far outweigh the
structural constraints.
Fast Slow
Phylogenetic mapping of structures
Predicting rates of evolution
This experiment was conducted to see if we could predict the rate of evolution in
the enzyme Enolase.
Phylogenetic mapping of structures
Predicting rates of evolution
The most important factor to predict
evolutionary constraints was the
presence of the active site.
Evolutionarily constrained by the active site.
Summary
Structures are rigid templates to provide some biological function.
It takes a lot of structure to position a few atoms in an enzyme.
Structural Homology
Because 1 structure is made of thousands of coherent interactions:
The probability to see a new structure emerge from a random sequence is close to 0.
Therefore: similar structures are likely to be homologous.
Use of structural similarity in evolutionary studies
Homology can be detected via sequence identity
Structures are drifting at a much smaller rate. In fact, are they drifting at all?
Structural similarity can be used to detect homology, although there are evidences that
convergence is much more common in structure than sequence.
Structural Convergence
There are so many different ways to fold a dozen of secondary structure elements.
Some fold are much more probable to evolve because they are more robust to mutations.
Designability
Protein Similarity
VASTAlign secondary structure only.
Consider the geometric transformation that brings as
many helices and strands together.
CEBreak down each structures in
peptide of 8 residues. Find the best match against a reference
protein. The final alignment is the transformation that allow to align as many continuous residues as
possible.
Comparing and aligning structures
Expanding into detection methods
What about for remote, yet significant similarities.
Example on the right
There is a significant similarity between a single domain in two distinct proteins (yellow and orange).
Are they homologous?
Comparing and aligning structures
Difficulties in aligning structures.
In some cases, the order of the elements that superimpose have been shuffled by circular permutation.
There are many cases of structurally similar proteins with no more than a random degree of identity at the sequence level.
Comparing and aligning structures
VAST (Vector alignment sequence tool)
Probably them most used service for protein alignment since it is running off the NCBI web site and has already been run on every available structures.
1 – Given two proteins A and B.
2 – Given that each structure has a collection
of secondary structure element (SSE).
1 2 3
1 2 3
, , ,..., ,
, , ,...,
SSE n
SSE m
A H S H H
B H H S S
Comparing and aligning structures
VAST (Vector alignment sequence tool)
3 – Find the rotation, translation to apply to each helices/strands to in A to align with each elements in B.
These transformations can be summarized by a matrix
1 1 2 1A B A B 1
...nA B
1 2 2 2A B A B 2
...nA B
1
... ... ... ...
mA B2 mA B ...
n mA B
1 2 3
1 2 3
, , ,..., ,
, , ,...,
SSE n
SSE m
A H S H H
B H H S S
Comparing and aligning structures
VAST (Vector alignment sequence tool)
4 – If two structure are identical, each helix/strand will be part of a pair with a common would just be the transformation to align the whole proteins.
In remotely similar structure, not all helices/strands will have a match.
The best set of rotation/translation will be the one that is shared by the largest number of secondary structures pairs.
Comparing and aligning structures
VAST (Vector alignment sequence tool)
4 – Sharing has to be defined a bit more formally (where alpha would some kind of tolerance cutoff to determine if two transformations are identical):
i j
Comparing and aligning structures
VAST (Vector alignment sequence tool)
4a – Every time we have a “match”, we draw a link between two The result would be a so-called graph with connection only between similar set of rotation/translation.
i j
i
i
i i
i
i
i
i
i
i
i
i
ii
i
i
i
i
i
Comparing and aligning structures
VAST (Vector alignment sequence tool)
5 – Once the problem is abstracted into a ‘graph’, it is possible to use the computational bag-of-tricks to figure out which set of connected matrices forms the largest group. The average rotation/translation in this group would best superimpose protein A and B.
i
i
i i
i
i
i
i
i
i
i
i
ii
i
i
i
i
i
Comparing and aligning structures
VAST (Vector alignment sequence tool)
7 – The alignment is performed irrespective of the sequence order of the structural elements. This is good because it can catch circularly permuted proteins. But it also enhances the chances to find match by accident.
Comparing and aligning structures
VAST (Vector alignment sequence tool)
8 – Statistical vallidation. This is a very important step since there is only a limited number of ways a small number of SSE will interact. Thus, sampling in a large database of random structure would still return a distribution of “hits”.
This is second hand information:
The p-value is the probability to observe a similar score by chance multiplied by the number of possible alternative substructures within the comparison.
The default cutoff = 0.05. Which should be regarded as a noise reduction cutoff, not a bulletproof jacket.
Comparing and aligning structures
CE (Combinatorial extension)
CE doesn’t uses secondary structure elements as basic aligning unit. Instead, it seeks the optimal path amongst all possible n-mers between two query proteins.
1 – Given two proteins A and B of length nA and nB. CE will search for the longest continuous path P of aligned fragment peptides (AFP) of length m.
Comparing and aligning structures
CE (Combinatorial extension)
4 – Some distance metric has to be made up to score AFP alignment
1
, ,0
1i k j k i k j k
mA B
ij p p p pk
D d dm
1 1
, ,20 0
1i k j l i k j l
m mA B
ij p p p pl k
D d dm
Each residue is counted once.
Each residue is counted against all.
Using RMSD
Comparing and aligning structures
CE (Combinatorial extension)
4 – Pathfinding
There is a substantial decrease in the size of the search space by restricting the value of G
There is a substantial decrease in the size of the search space by restricting the value of G
1 – Select all possible next AFP under a certain (self) threshold.
2 – Consider the path to chose the best next AFP.
3 – Choose whether to pursue the extension or not.
Comparing and aligning structures
CE (Combinatorial extension)
4 – Statistics uses a z-score which compares path of similar length and score to a random sampling from a reference database.
z-score of 3.5 -> p-value of 10E-3
So, given about 2000 different protein folds, such threshold would imply two fortuitous hits. Visual inspection must be done as well as a more restrictive threshold should be used.
Comparing and aligning structures
CE (Combinatorial extension)
Structural similarity between Acetylcholinesterase and Calmodulin found using CE (Tsigelny et al, Prot Sci, 2000, 9:180)
SCOP database
http://scop.berkeley.edu/ Seen as the golden standard for protein structure classification
Query for structures given a protein sequence
Browse protein architecture organized in a hierarchical fashion.
Keyword search for structures.
Fold Common topology for secondary structure
Superfamilies probable common evolutionary origins, low sequence ID
Families (common evolutionary origins)
domains
individual domains
CATH database
http://www.biochem.ucl.ac.uk/bsm/cath/ Involves manual inspection and classification, especially at more abstract levels such as the architecture-level.
CLASS secondary structure composition
Architecture what would be know as fold in SCOP)
Topology (What would be known as superfam.)
Sequence-level
Summary
Aligning protein structure can detect homologous relationship that are deeper that sequence alignment because structures are more stable over time.
VAST abstracts proteins into SSE, or secondary structure elements and find the set of rotation/translation that maximize the number of paired SSE.
CE looks for the best alignment frame to superimpose a protein into another.
Statistics are important because it is likely that small unrelated structures will resemble each other.
Summary
A distribution of random protein scores can be generated by aligning unrelated proteins in the databases. An alignment score must be significantly larger than score expected in this distribution.
This type of analysis is used to classify protein folds and infer relationship between structural evolution and biological activity.
Try to find structural neighbors of the protein 1AZT while browsing the NCBI website ( www.ncbi.nlm.nih.gov ).
Molecular Modeling
Lecture 4
Why modeling proteins
Example of applications
Modeling the binding site of the anticodon on eRF3
Modeling substrate binding in the active site of Mandelate racemase.
Solving X-ray and NMR structures.
The theory behind the calculation
Parametrizing protein models
Molecular mechanics as an optimization problem
Molecular mechanics as a time simulation
Conceptual clash between protein folding and molecular mechanics.
Why modeling proteins
Anticodon binding site on eRF32 possibilities.
From phylogenetic information, a few residues were identified as players.
Use molecular mechanics to “see” whether the surface of the protein ca accommodate
an anti-codon.
Why modeling proteins
Modeling a weird substrate into an active site.
Mandelate racemase can bind a substrate with two rings! Is there room for this in the wild type active site?
The answer is yes, although a bit counter-intuitive.
How do structures are viewed
Pre-computer days
Sir John Kendrew and his model of insulin, 1958
How do structures are resolved
X-ray diffraction PrincipleCreate a lattice of protein into a crystal.
Collect thousands of diffraction pattern in all degree of freedom rotational space.
Substract the phase between the layers in the lattice.
Compile into a 3D volume based on density of reflective material (electrons in this case).
Thread model into density map, optimize the geometry using the density map as an additional criterion.
How do structures are resolved
NMR spectroscopyPrincipleUse magnetic fields and “radio” frequency photons to detect shifts in nuclear states.
Assign shifts to a model along the chain.
Correlate the mutual effect amongst elements on each other to come up with a list of constraints (typically distances).
Optimize the trajectory of the modeled chain, given this list of constraints.
How do structures are resolved
NMR spectroscopyPrincipleUse magnetic fields and “radio” frequency photons to detect shifts in nuclear states.
Assign shifts to a model along the chain.
Correlate the mutual effect amongst elements on each other to come up with a list of constraints (typically distances).
Optimize the trajectory of the modeled chain, given this list of constraints.
Physical simulation in Molecular Modeling
Jensen, F., 1999, Introduction to computational chemistry, Whiley,
Chichester, UK
Why is it useful to you?
Modeling is used often by experimental biochemists and is a staple in structural biology.
The complexity of the simulation is far beyond the complexity of the interface. This necessarily convey a false sense that the “defaults” settings will do fine.
Physical simulation in Molecular Modeling
Limitations
True atoms and bonds are probabilistic constructs. The computation of the resulting geometries is a very involved process for which the analytical equations are not fully worked out.
Luckily, the observable behavior is much more predictable and thus can be modeled under a limiting set of assumptions.
Physical simulation in Molecular Modeling
Assumptions
Newtonian physics is used to simulate molecules under a set of restrictions which for proteins would be:
1. In solution (or vacuum).
2. Near room temperature.
3. Chemically inert.
Physical simulation in Molecular Modeling
AbstractionEach atoms has a fix geometry constrained by a somewhat arbitrary energy scoring scheme.
The problem thus boils down to find the best set of coordinates for all atoms to minimize the energy.
There are no absolute correspondence between this scoring scheme and experimentally measurable energy values.
Molecular Modeling in Bioinformatics
Modeling
Although there is only a small subset of all possible atoms that end-up in biological molecules. Each atoms has a set of different states in which they exist. These states are referred to as types in molecular modeling.
Molecular Modeling in Bioinformatics
Energy function
The energy function is used to evaluate and calculate the derivatives use to optimize a structure.
FF str bend tors VdW el crossE E E E E E E
O N O N O N 2O N
2O N 2O N
Molecular Modeling in Bioinformatics
Computational efficiency and limitations of the model
The energy function is used to evaluate and calculate the derivatives use to optimize a structure.
02 AB ABr
a Bt
b As kP E R R R
2
2 0
3 4
3 0 4 04 AB AB AB ABstr
ab AB ab AB ab ABk R kP E R R R RR k R
1ABAB R
strMorse eDE R
Molecular Modeling in Bioinformatics
Parameterization nightmare
Can someone come up with all these numbers?
Generalization
How robust is the simulation in a range of conditions.
Computational cost
The longer it takes to perform a single task, the fewer iterations will be computed in the same amount of time.
Molecular Modeling in Bioinformatics
Parameterization nightmare
Can someone come up with all these numbers?
For MM2 forcefield (71 atom types):
Term Params(est.) Determined
E(VdW) 142 142
E(str) 900 290
E(bnd) 27000 824
E(tors) 1215000 2466
E(cross) 107-8 ?
hc
E
Molecular Modeling in Bioinformatics
Generalization
How robust is the simulation in a range on conditions.
In the example to the left, the EXP.-6 model causes nuclear fusion at unrealistic distances.
Such unrealistic distance will be found in Monte-Carlo, Genetics Algorithms and Simulated annealing experiments.
Molecular Modeling in Bioinformatics
Lennard-Jones
Is actually a computational stunt so there is no need to compute R but rather use Rn where n is an even factor.
12 6
( ) o oR RE R
R R
2 2 2
ij i j i j i jR x x y y z z
6 6( ) BR C
AEXP R eR
Molecular Modeling in Bioinformatics
Lennard-Jones
In practice, Lennard-Jones is optimized to reproduce validated results (and works out satisfactorily).
Molecular Modeling in Bioinformatics
Electrostatic Models
… are real ugly.
Why does this matter?
Electrostatic fields decay with 1/ distance. Which makes them the longest-ranged interactions.
Examples
Coulomb’s Law
( )el AA
BAB
BEQ Q
RR
Molecular Modeling in Bioinformatics
Electrostatic Models
… are real ugly.
Why does this matter?
Electrostatic fields decay with 1/ distance. Which makes them the longest-ranged interactions.
Examples
Dipolar moment interactions
3( ) cos 3cos cosA B
el AB A B
AB
E RR
Molecular Modeling in Bioinformatics
Computational cost of non-bonded energy (VdW, El)
~99.88% of computation in protein-sized models. Most of this is very small and does not contribute to the total energy significantly.
Computational tricks
Cutoff -> blending function -> neighbor list*
*must be updated O(N2)
Validation
1. Reproduces Geometries
2. Reproduces Relative energies.
E is not G
G H TS Real energies are temperature-dependant.
Entropic contribution cannot be calculated from a snapshot.
Principle of optimization
You start with a protein for which you know all coordinates.
Evaluate the energy
Find a better structure, usually with small changes
Repeat until no better structure can be found.
This task is usually NEVER straightfoward, unless the system would be made of a small number of atoms.
Molecular Modeling in Bioinformatics
Optimization (local minima)Straightforward, although computationally expensive.
1 – A clear equation.2 – A defined set of variables.3 – “only” three dimension to worry about
Steepest Descent (Robust, fast)Conjugate Gradient (Improved convergence properties)Newton-Raphson (Saddle points)
Pseudo-NR (progressive Hessian estimate)
Molecular Modeling in Bioinformatics
Optimization (Global minimum)In a simple, circular, system with 17 main-chain atoms. There are 262 distinct conformations within 3 kcal/mol from the global minimum (out of ~1.6E13 conformers).
The size of proteins is 1-2 order of magnitude larger.
Stochastic Methods (Monte-Carlo)Molecular DynamicsSimulated AnnealingGenetic AlgorithmsStatistical Mechanics
Molecular Modeling in Bioinformatics
Time dependent methods (Molecular Dynamics)
Make use of classical mechanics equations such as:
F ma
2 31
1 1...
2 6i i i i ir r v t a t b t
2 31
1 1...
2 6i i i i ir r v t a t b t 2
1 12i i i ir r r a t
Verlet AlgorithmNumerical solution to Newton’s equations
Molecular Modeling in Bioinformatics
2 31
1 1...
2 6i i i i ir r v t a t b t
2 31
1 1...
2 6i i i i ir r v t a t b t 2
1 12i i i ir r r a t
Verlet AlgorithmNumerical solution to Newton’s equations
Problems with this methods
No explicit use of speed (which is needed to calculate the total energy):
2
1
1
2
N
Tot i ii
E m v U r
Molecular Modeling in Bioinformatics
2 31
1 1...
2 6i i i i ir r v t a t b t
2 31
1 1...
2 6i i i i ir r v t a t b t
1 12
i i ir r v t
Leapfrog AlgorithmNumerical solution to Newton’s equations
TimestepReasonable: Femtoseconds 10-15
Scope of simulation (ideal): Millisecond 10-3
(practical): Microsecond 10-6
21/ 2 1/ 2i i iv v a t
Molecular Modeling in Bioinformatics
Simulated AnnealingRobustness vs. initial solution
Variable contribution of the objective function.Broader Sampling.
Both help to explore around a minimum.
F U K
potential
Blending functionKinetic
Net Movement
Protein folding from Scratch
Must be restrained to a limited scope
Two genes: TC5b and TC3b
Both have references structure for validation.
Sequences
NLYIQWLKDGGPSSGRPPPS (TC5b; 304 atoms)
NLFIEWLKNGGPSSGAPPPS (TC3b; 289 atoms)
Software: AMBER 6.0
Model: AMBER
Solvation: Generalize Born/solvent-accessible surface area
This means that the water molecules are not explicitly defined in the simulation and the effect of the solvent is treated as a macro
property.
Protein folding from Scratch
Must be restrained to a limited scope
Understanding folding and design: Replica-exchange simulation of “Trp-Cage” miniproteins.
Pitera, JW., Swope, W. 2003. Proc. Natl. Acad. Sci. USA, 100: 7587-7592
Protein folding from Scratch
Algorithm
Initialization
Input: A protein sequence
Output: A starting structure for the main simulation.
1: Thread each character from the input sequence to a 3D corresponding model (extended).
2: Minimize with 5000 steps of steepest descent
3: for i = 1 to 50000 do
Simulate with Molecular Dynamic
if !(i%1000) then Readjust the temperature 298K.
4: Return equilibrated model.
Required to prevent strong “jerking” motion in the first iteration of a simulation
Protein folding from Scratch Algorithm
Simulation (simulated annealing variant)
Input: P, An equilibrated protein model
Output: A collection of coordinate snapshots (trajectory) for analysis.
1: T = a list of 23 simulation temperatures from 250K to 603K.
2: E = {} , an empty list of experiments
3: for i = 1 to |T| do
4: Pi = Copy P
5: Set the temperature of Pi to Ti
6: Add Pi to E
7: for i = 1 to 4,000,000 (4 ns) do
8: Simulate using MD |in parallel|
9: if i % 250 == 0 then take a Snapshot of coordinates.
10: if i % 5000 == 0 then
11: Swap temperature between process (Metropolis-style probabilities)
12: Adjust each E to their new simulation temperature
13: Discard all but the snapshot taken in the last nanosecond of simulation.
14: Pool all 23 experiments for analysis.
Protein folding from Scratch
Computational cost
Ridiculously small protein, no initial good guess.
19 days on 23 200 MHz IBM POWER3 SP2 processor (R6000 series)
Which, on the campus here, approaches the mean time between power outage!
Protein folding from Scratch
Validation
The root mean square deviation RMSD
2
1
n
i refatom i
n
ii
w i i
RMSDn w
Which is a suitable distance metric for related structures.
Comparing and aligning structures
Why?
There is a need for a distance metric to compare similar protein structure.
Simulation analysis.
Similarity quantification.
Pattern detection.
RMSD
Works well for closely similar structure.
2
1
n
i refatom i
n
ii
w i i
RMSDn w
Comparing and aligning structures
RMSD
Works well for closely similar structure.
2n
refatom i
i i
RMSDn
Absolutely require some kind of pair wise equivalence between the two compared
entities,
Comparing and aligning structures
RMSD
Sequence identity falls quickly.
Hard to separate weak hits from purely random proteins.
2
1
n
i refatom i
n
ii
w i i
RMSDn w
Protein folding from Scratch
Validation
The root mean square deviation RMSD
2
1
n
i refatom i
n
ii
w i i
RMSDn w
≤2.0 RMSD
from any of 38 experimental structures
≤2.0 RMSD from the
average low temperature structure.
Protein folding from Scratch
Impact
Impact of this paper
Make good use of parallelism to conduct a heuristic search.
Sampling-based method.
Promising because in many cases the folding of a large
protein can be approximated to the folding of its components.
(Remember, domains are independent units in most
cases)
Building a large machine for molecular modeling
IBM Blue Gene project
Architecture
64K FPU
20K FPU (protein folding)
FPU 64-bit @ 700 MHz (low cost, low heat)
64 compute nodes (256 MB) per I/O nodes (512MB)
MPI library
3D torus network for fast neighbor to neighbor communication.
High Performance achievement in MD NAMD
Open source
University of Illinois, Dept. of theoretical physicshttp://www.ks.uiuc.edu/Research/namd/
Benchmark system
(their big one)
High Performance achievement in MD NAMD
Open source
There is no need to use this system to study protein folding.
Instead, MD were used in this case to study the conversion of torque into energy that can be stored in molecular batteries: the ATP molecule.
Overview
Protein folding and parallel computing.
Homology modeling and statistical mechanics.
Secondary structure prediction and artificial intelligence.
Spectrum of strategies
Physics Knowledge
Quantum mechanics Molecular Mechanics Statistical Mechanics Homology Modeling
Parallel computing and Molecular dynamics
Folding protein from an extended conformation is a difficult problem because of the crossing of energy
barriers.
The following slides describe how crossing barrier can be achieved using a technique called parallelization.
Parallel computing
It takes 1500 days to complete a thesis for one student
If the student is helped by someone, the work may go 2X as fast: 750 days.
What if 1500 students are working on the same thesis?
Overhead
Communication
Load balancing
Parallel computing
Factors that complicate parallelization:
Some work have to be executed in a sequence
Communicating the task and the results becomes an increasingly important time step as the task become small.
Each individual process have to wait for the slowest one to finish, leading to a loss of efficiency.
Time scale in protein folding
In the order of micro to milliseconds
This is not achievable by modern computers.
~10 000 days for 1 experiment (~28 years)
folding@home
Hundreds of million computer idle at any time
Why not use their unspent cycles.
Create a “screen saver”
Crossing energy barrier
Most of the time is spent waiting for the thermal motion to topple a structure over a barrier.
Principle of Ensemble dynamics
M CPU should take M X less time to go over a barrier.
K = 1/10,000 ns , M = 10,000 , t = 30 ns
f(t) ~ 30 folding events
( ) 1 exp( )f t kt
Ensemble Dynamics
Start M dynamic calculations with the same initial structure.
Once 1 thread finds a barrier and go over it, copy the state of this thread into all other M
replicate processes.
The communication overhead is negligible if the crossings
are rare events, which is true in this case.
Ensemble Dynamics
Detecting a barrier
Will be noticeable by a large variance in energy over the duration of the simulation.
Ensemble Dynamics
Calculation details
We simulated folding and unfolding at 300K at pH 7.0,
using OPLS parameters set to Generalized Borne implicit
solvent model.
Time step 2 fs
Long range interaction truncated with a 16A cutoff.
What are they doing with this technique?
A more complex system
Note how most of the interactions in the partially
folded protein are non-native.
This means that in order to resume folding, these must
be broken.
The Villin headpiece is one of the fastest (known) folding peptide !! What
about simulating anything else?
Energy Landscape
It is clear in this figure that there are:
1. one folding pathway
2. One intermediate
3. Two energy barriers
Statistical Mechanics
Practical definition for our purpose:
Statistical mechanics can be used to create predictive models in absence of theoretical models.
For example: interaction between amino-acids.
Statistical mechanics
Atom-level simulation are expansive, and empirical.
Statistical mechanics bridges frequencies of observations with physical forces for chemical systems.
The resulting model is thus used to assess the “energy” of a trial conformation and can be used as an objective function to optimize a
solution.
This technique is increasingly used in bioinformatics since the information in the database can be seen as the collection of
observation at equilibrium.
Statistical mechanics
In other words, if it can’t be seen in the database, the energy state of an observation must be high. If its
common, the energy must be low.
Remember, everything is possible, the probability of an observation is related to its relative energy.
lni iE RT f
How does this ties in to bioinformatics?
There is a direct relationship between energy of a state in a system in equilibrium and the probability to observe this
state.
lni iE RT f
ln ii
ii
nE RT
n
What are “states” in protein structures?
There are a lot of freedom in defining states for protein structures. Here is one example:
Sequence
Sequence
ContactsIn this plot, if two positions of
the 1D sequence are in physical contact, it is marked
as an orange pixel.
It is thus possible to harvest from a collection of structures a matrix of observed contacts.
What are states in protein structures?
There are a lot of freedom in defining states for protein structures. Here is one example:
Sequence
Sequence
ContactsIn this case the energy for any
given pair would be:
( , )
,ln
,Pair a b
i
n a bE RT
n a i
What are states in protein structures?
There are a lot of freedom in defining states for protein structures. Here is one example:
Sequence
Sequence
In order for this value to be valid; there is an assumption
of equilibrium.
Equilibrium:
The sampling would not change significantly over time.
What are states in protein structures?
There are a lot of freedom in defining states for protein structures. Here is one example:
Sequence
Sequence
PitfallIn order to be accurate for rare observation, the total number
of observation should be infinitely large and derived
from sequences-structures in equilibrium.
Practically, there should be enough instances of the rarest entry to avoid large errors on
the estimate (log(0)).
What are states in protein structures?
There are a lot of freedom in defining states for protein structures. Here is one example:
Sequence
Sequence
Miyazawa-Jernigan Matrix
Such matrix has been generated
Miyazawa, S.,Protein Eng. 1993 Apr;6(3):267-78
This is particularly useful for threading sequences in known structures for structure prediction purpose.
What are states in protein structures?
The implementation of a distance-based energy term is trickier… but boils down to the same thing.
Knowledge-based force-field
Need to store in 4D matrices the tuple
{ (a,b), r, k }
R distance in Euclidian space
K distance in sequence space
x1
x2
x3
x4 x5
x6
x7x8
x9
r
k = 6
What are states in protein structures?
The energy will be calculated with respect to all parameters considered.
Knowledge-based force-field
x1
x2
x3
x4 x5
x6
x7x8
x9
r
k = 6
( , , , )
,
,ln
kr
kr
iPair a b r k k
rk
n a b
n a iE RT
nn
What are states in protein structures?
There are some implementation for this technique, such as PROSAII
http://www.came.sbg.ac.at/Services/prosa.html
Knowledge-based force-field
x1
x2
x3
x4 x5
x6
x7x8
x9
r
k = 6
( , , , )
,
,ln
kr
kr
iPair a b r k k
rk
n a b
n a iE RT
nn
What are “states” in protein structures?
The exposure of each site to the exterior is an important factor. This is often quantified as Solvent Accessible Area
(ASA)
Knowledge-based force-field
Need to store in 2D tuple
{ a, ASA }
,
{ }
,ln
,a ASA
i ASA discretization
n a ASAE RT
n a i
What are states in protein structures?
Ultimately, the energy of seeing a given sequence adopt a given structure can be computed as follow:
Knowledge-based force-field
Tot Pairs Solv otherE E E E Caveats
The finer is the parameterization, the larger must be the reference collection of (appropriate) structures in the database in order to observe many times all possibilities.
Design-level decision as to the choice of the minimum set of terms to fully define a structure.
An example
Real life example of using Knowledge-based methods.
This enzyme is called Enolase. It is a key enzyme in the sugar breakdown metabolism.
If there are important terms that are forgotten, the energy values may be inadequate.
An example
Real life example of using Knowledge-based methods.
The function and the composition are very tightly related.
Red negatively chargedBlue positively chargedTan Hydrophobic
These are the active site residues.
An example
Real life example of using Knowledge-based methods.
The critical region in this protein has radically different properties than expected in an average protein. The knowledge-based system does not account for these properties and thus, the position shown in white were poorly estimated.
The way this assessment was done quantitatively goes well beyond the scope of this course.
An other example
Cubic lattice simulation
The dimensionality of the protein folding problem can be reduced by simplifying the geometric properties of the system.
Knowledge-based energy evaluation can be used as an objective function that is relevant to the physical world, without the need to fully define a system with the 6 degrees of freedom.
Spectrum of strategies
Physics Knowledge
Quantum mechanics Molecular Mechanics Statistical Mechanics Homology Modeling
Homology Modeling
Homology
Related by a common ancestor.
Sequence identity amongst homologous structure can be as low as 15%.
Why making models?
There is a good chance that the structural efforts will never catch up with the sequencing projects.
How?
Figure out the most probable 3D structure, given a (1D) sequence and a 3D template from a related protein.
Homology Modeling
Assumption
•Regions of alignable sequence share homologous structures•Loop regions (non-conserved residues) allow insertions and deletions without disrupting the overall structure of a protein.
Query sequence
Sequence Similarity to
Solved structure?
PSI-Blast/profile MSASecondary Structure Prediction
Fold prediction
Homology Modeling Model Validation
Homology Modeling
Aligning a sequence and a structure
MSA (multiple sequence alignment) between the query and the sequence of the target structure.
Profile MSA – The query and a MSA of homolog proteins to the target structure.
Threading.
Homology Modeling
Principle of threading
“Pull” a sequence through a structure such that the alignment correspond to the frame with the best energy score.
Homology Modeling
Energy evaluation for threading
Statistical mechanics is ideal in this case because physical models would require extensive simulation time to figure out the precise atomic conformation.
Homology Modeling
Threading to detect correct alignments
The application GenTHREADER uses threading to perform protein fold recognition from genomic sequences.
Homology Modeling
General Principle
1. Align to the sequence of a known structure.2. Change the structure of the side-chains to match the query
sequence according to the sequence alignment.3. Model loops and variable regions.4. Minimize energy / conformational search5. Check models for inconsistencies.
Feasibility
> 40% sequence identity is preferable.25% - 40% “Twilight Zone”< 25% Insufficient similarity in most cases.
May work only for one domain out of the whole protein.
Neural Network
Anatomy of a NN:
Input parameters Output parametersWeights
Neural Network
Before a NN can be used, it must be trained:
Training compared the output of a NN with a known answer, the weight of each “arrows” is changed to minimize the error.
Secondary Structure prediction
Three Generations of methods
Generation Approach
1 (’60-’70)
GOR1
Single character statistical information
~ 57% ACC
2 (‘80)
GOR3
Local interactions
~ 63% ACC
3 (’90+)
PHD
Homologous protein sequences
~ 72% ACC
Secondary Structure prediction
1ST Generation
Making use of compiled frequencies of the different characters for three possible classes:
Helix (H)
Strand (S)
Coild (-)
SDFDKILVSTYSPPQARILIVM
-----SSSSSSS----HHHHHH
Secondary Structure prediction
2nd Generation
Making use of compiled frequencies of the different characters for three possible classes.
Considering the periodicity and neighbors.
Sliding window analyses
SDFDKILVSTYSPPQARILIVM
-----SSSSSSS----HHHHHH
Secondary Structure prediction
3rd Generation
S D F ... M
0.1 0.01 0.0 ... 0.0
0.0 0.98 0.1 ... 0.09
... ... ... ... ...
0.02 0.0 0.05 ... 0.7
Frequency vectors obtained from multiple sequence alignments.
These MSA can be generated using BLAST
or Psi-BLAST
Also known as profiles
Secondary Structure prediction
Best done using Neural Networks (or HMM… )
3rd Generation
S D F ... M
0.1 0.01 0.0 ... 0.0
0.0 0.98 0.1 ... 0.09
... ... ... ... ...
0.02 0.0 0.05 ... 0.7
H H - … S
The NN output of the profiles gets scanned by a few, distinct, NNs using a sliding window
strategy.
Assignment on the basis of the “winner
takes all”.
Secondary Structure prediction
Alignment grow, secondary structure prediction improvesPrzybylski, Rost. 2002. Proteins, 46:197-205
Conlcusions
•Using MSA (multiple sequence alignment) significantly improve the predictions (0.72 -> 0.75)
•The larger the dB used, the better. However, there is a point where the information content saturates.
•Psi-BLAST vs BLAST: BLAST may be better in some cases.
•Refining the alignment did not help.
Secondary Structure prediction
Bidirectional Dynamics for protein secondary structure prediction
Baldi et al., 2000, in Sequence learning, pp. 80-114
IOHMM model
Memory evaluated experimentally at about 15 characters
Secondary Structure prediction
Bidirectional Dynamics for protein secondary structure prediction
Baldi et al., 2000, in Sequence learning, pp. 80-114
Recurrent Neural Network implementation
Overview
Protein folding and parallel computing.
Current simulation works for modest-sized systems.
Homology modeling and statistical mechanics.
There is a clear advantages to use the information that we already have to solve new problems.
Secondary structure prediction and artificial intelligence.
Machine learning is appropriate to capture the trends leading to prediction.
Top Related