Protein folding
description
Transcript of Protein folding
![Page 1: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/1.jpg)
Protein folding
Process of folding
Modeling the process of folding
Evolution vs. folding
Impact of function on protein evolution
![Page 2: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/2.jpg)
Process
Local Interactions
Secondary Structure Elements (SSE)
Assembly of SSE
Equilibrium Structure
![Page 3: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/3.jpg)
Protein folding
http://www.blueprint.org/proteinfolding/trades/details/trades_movies.html
![Page 4: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/4.jpg)
Protein folding
Important thing to note
It is possible that residues that are not doing anything in the folded protein were
actually critically important to get the peptide folded in the
first place.
![Page 5: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/5.jpg)
Protein folding
Simulation studies are demonstrating that the most common protein folds are those who can
withstand the most sequence variation over time without
affecting their topologies. The prion protein is a posterchild example
of the opposite.
![Page 6: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/6.jpg)
Protein Evolution
Evolutionary meaning
Most common folds are those able to
withstand point mutations the best.
These are known as designable folds.
![Page 7: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/7.jpg)
Protein folding
Marginal stability
The most stable folds are not necessarily these
with the lowest energy.
But these that maximally penalize switching to an alternative conformation.
![Page 8: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/8.jpg)
Protein Evolution
Marginal stability
Evolutionary implication(s)
There is thus selective pressure on residues in
protein not only to maintain important
interaction, but also to make sure that some interaction NEVER
happen.
![Page 9: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/9.jpg)
Summary
Proteins fold into energetically stable conformations.
For one chain, there are a large number of possible conformations, however.
The biological conformation is selected during folding: not necessarily the “best” conformation.
![Page 10: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/10.jpg)
Role of biology on structures
A few examples using mapping of rate of evolution.
The fitness of a protein is ultimately its biological function, not its structure.
We’ll have a look at their structural requirements.
![Page 11: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/11.jpg)
Structural Biology
![Page 12: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/12.jpg)
Outline
How genetics encode structure.
What make a protein fold.
Role of biological function on preserving a fold.
Comparing two structures for similarities.
![Page 13: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/13.jpg)
Genetic information and proteins
3D information is encoded into (1D) sequences.
STKKKPLTQEQLEDARRLKA IYEKKKNELGLSQESVADKM GMGQSGVGALFNGINALNAY NAALLAKILKVSVEEFSPSIAREIYEMYEA
Protein structure of CRO repressor in phage Lambda, PDB: 1LMB
?
![Page 14: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/14.jpg)
Genetic information and proteins
The encoding can only be indirect
Because there is nothing in the DNA
that tells each amino acid where to go.
![Page 15: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/15.jpg)
Genetic information and proteins
However,
There is a few types of physical interactions that are dominating
the process of protein folding.
![Page 16: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/16.jpg)
Amino-acidsComponentsMain Chain
Side Chains
Side ChainsResponsible for the “name”.
Can be clustered based on:
- chemical properties
- Structure
This ultimately determine the evolutionary interchangeability.
![Page 17: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/17.jpg)
Protein folding
Van der Waal forces
The electron clouds around the nuclei are more
stable if they can lightly interact with other electron clouds.
Makes atoms sticky relative to each other.
![Page 18: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/18.jpg)
Protein folding
Electrostatic forces
Long range interactions.
Pull/Push over longer distances.
![Page 19: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/19.jpg)
Protein folding
Hydrogen bonds
Electrostatic. Short range, not flexible
Can be seen as the velcro holding proteins
together.
![Page 20: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/20.jpg)
Protein folding
Hydrophobic interactions
Water molecules in liquid pack as to minimize their
energies
This implies that water molecules are more than often are doing H-bond
with their neighbors.
![Page 21: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/21.jpg)
Protein folding
Hydrophobic interactions
If you introduce a droplet of oil in solution, many hydrogen bonds will have to be broken at the interface, at an energy cost.
This is why hydrophobic and hydrophilic groups look like they are avoiding
each other.
![Page 22: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/22.jpg)
Protein folding
During folding,
The polypeptide has to follow a strict
sequence of event in order to find the
correct conformation in a timely fashion.
![Page 23: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/23.jpg)
Protein folding
Secondary Structures
Stable because of local h-bonds.
Makes larger block with fewer freedom of
movement
![Page 24: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/24.jpg)
Protein folding
Geometry plays a very important role.
Because there are only a few angles that can
change along the backbone, there is a
limited number of ways a protein can
fold onto itself.
![Page 25: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/25.jpg)
Protein structures are organized in a Hierarchical fashion
Secondary structures - Geometry
Dihedral AngleBecause most main chain atoms are constrained in a “amide bond”, the entire trajectory of the chain can be defined by the pair of angles (for each AA):
This can be represented with a
“Ramachandran Plot”.From which it is obvious that there are some kind of clustering going-on.
,
l
l
![Page 26: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/26.jpg)
Protein structures are organized in a Hierarchical fashion
Secondary structures – The alpha helix
The Hydrogen BondAgain, a helix is an ideal setup to place our “velcro” H-bond always at the right place.
PeriodicityTo the delight of statisticians and computer scientists.
![Page 27: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/27.jpg)
Protein structures are organized in a Hierarchical fashion
Secondary structures – The beta strand (beta sheets)
Another periodical pattern ( )Responsible for super-structure rigidity and some truly amazing patterns.
2f
![Page 28: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/28.jpg)
Protein structures are organized in a Hierarchical fashion
Secondary structures – The myth of “random” coil.
Random structures in protein are extremely rare.Many uses the expression anyway to refer to the “rest” of the protein.
Other minor secondary structuresTurns, loops, bridges. Although these don’t have the critical periodicity found in and structures.
![Page 29: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/29.jpg)
Protein structures are organized in a Hierarchical fashion
Tertiary structures – The reason why to care about 2nd structures.
Secondary structures are building blocksDetecting and predicting secondary structures is a key process in structural biology.
Other usesVisualization, classification…
![Page 30: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/30.jpg)
Protein Diversity
The current release of PDB contains 28,000
structure entries.
26,000 are proteins
There is an estimated 600-8000 possible
unique protein folds.
http://www.jacquesdeshaies.com/expositions/virtual/new-virtual/uppsala-invit.html
![Page 31: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/31.jpg)
PDB
Overview
Repository of structuresProteins, Nucleotides, complexes, mutants
Quality improve over timeData validation tools are getting better. More redundant structure are available for cross-reference.
![Page 32: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/32.jpg)
Small number of folds
Does this means that all proteins are
coming from a small set of ancestor
molecule?
Perhaps, but not necessarily.
![Page 33: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/33.jpg)
Protein folding
Process of folding
Modeling the process of folding
Evolution vs. folding
Impact of function on protein evolution
![Page 34: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/34.jpg)
Process
Local Interactions
Secondary Structure Elements (SSE)
Assembly of SSE
Equilibrium Structure
![Page 35: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/35.jpg)
Protein folding
http://www.blueprint.org/proteinfolding/trades/details/trades_movies.html
![Page 36: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/36.jpg)
Protein folding
Important thing to note
It is possible that residues that are not doing anything in the folded protein were
actually critically important to get the peptide folded in the
first place.
![Page 37: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/37.jpg)
Protein folding
Simulation studies are demonstrating that the most common protein folds are those who can
withstand the most sequence variation over time without
affecting their topologies. The prion protein is a posterchild example
of the opposite.
![Page 38: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/38.jpg)
Protein Evolution
Evolutionary meaning
Most common folds are those able to
withstand point mutations the best.
These are known as designable folds.
![Page 39: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/39.jpg)
Protein folding
Marginal stability
The most stable folds are not necessarily these
with the lowest energy.
But these that maximally penalize switching to an alternative conformation.
![Page 40: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/40.jpg)
Protein Evolution
Marginal stability
Evolutionary implication(s)
There is thus selective pressure on residues in
protein not only to maintain important
interaction, but also to make sure that some interaction NEVER
happen.
![Page 41: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/41.jpg)
Summary
Proteins fold into energetically stable conformations.
For one chain, there are a large number of possible conformations, however.
The biological conformation is selected during folding: not necessarily the “best” conformation.
![Page 42: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/42.jpg)
Role of biology on structures
A few examples using mapping of rate of evolution.
The fitness of a protein is ultimately its biological function, not its structure.
We’ll have a look at their structural requirements.
![Page 43: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/43.jpg)
Fast Slow
Maximum-Likelihood Site-Rates are Biologically Relevant
Rhodopsin-like G-protein receptors
Pfam (dataset 1Tml_7) 69 taxa
![Page 44: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/44.jpg)
Maximum-Likelihood Site-Rates are Biologically Relevant
Tubulin
34 taxa 33 taxa
The constraints imposed by co-evolution far outweigh the
structural constraints.
Fast Slow
![Page 45: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/45.jpg)
Phylogenetic mapping of structures
Predicting rates of evolution
This experiment was conducted to see if we could predict the rate of evolution in
the enzyme Enolase.
![Page 46: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/46.jpg)
Phylogenetic mapping of structures
Predicting rates of evolution
The most important factor to predict
evolutionary constraints was the
presence of the active site.
Evolutionarily constrained by the active site.
![Page 47: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/47.jpg)
Summary
Structures are rigid templates to provide some biological function.
It takes a lot of structure to position a few atoms in an enzyme.
![Page 48: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/48.jpg)
Structural Homology
Because 1 structure is made of thousands of coherent interactions:
The probability to see a new structure emerge from a random sequence is close to 0.
Therefore: similar structures are likely to be homologous.
![Page 49: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/49.jpg)
Use of structural similarity in evolutionary studies
Homology can be detected via sequence identity
Structures are drifting at a much smaller rate. In fact, are they drifting at all?
Structural similarity can be used to detect homology, although there are evidences that
convergence is much more common in structure than sequence.
![Page 50: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/50.jpg)
Structural Convergence
There are so many different ways to fold a dozen of secondary structure elements.
Some fold are much more probable to evolve because they are more robust to mutations.
Designability
![Page 51: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/51.jpg)
Protein Similarity
VASTAlign secondary structure only.
Consider the geometric transformation that brings as
many helices and strands together.
CEBreak down each structures in
peptide of 8 residues. Find the best match against a reference
protein. The final alignment is the transformation that allow to align as many continuous residues as
possible.
![Page 52: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/52.jpg)
Comparing and aligning structures
Expanding into detection methods
What about for remote, yet significant similarities.
Example on the right
There is a significant similarity between a single domain in two distinct proteins (yellow and orange).
Are they homologous?
![Page 53: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/53.jpg)
Comparing and aligning structures
Difficulties in aligning structures.
In some cases, the order of the elements that superimpose have been shuffled by circular permutation.
There are many cases of structurally similar proteins with no more than a random degree of identity at the sequence level.
![Page 54: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/54.jpg)
Comparing and aligning structures
VAST (Vector alignment sequence tool)
Probably them most used service for protein alignment since it is running off the NCBI web site and has already been run on every available structures.
1 – Given two proteins A and B.
2 – Given that each structure has a collection
of secondary structure element (SSE).
1 2 3
1 2 3
, , ,..., ,
, , ,...,
SSE n
SSE m
A H S H H
B H H S S
![Page 55: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/55.jpg)
Comparing and aligning structures
VAST (Vector alignment sequence tool)
3 – Find the rotation, translation to apply to each helices/strands to in A to align with each elements in B.
These transformations can be summarized by a matrix
1 1 2 1A B A B 1
...nA B
1 2 2 2A B A B 2
...nA B
1
... ... ... ...
mA B2 mA B ...
n mA B
1 2 3
1 2 3
, , ,..., ,
, , ,...,
SSE n
SSE m
A H S H H
B H H S S
![Page 56: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/56.jpg)
Comparing and aligning structures
VAST (Vector alignment sequence tool)
4 – If two structure are identical, each helix/strand will be part of a pair with a common would just be the transformation to align the whole proteins.
In remotely similar structure, not all helices/strands will have a match.
The best set of rotation/translation will be the one that is shared by the largest number of secondary structures pairs.
![Page 57: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/57.jpg)
Comparing and aligning structures
VAST (Vector alignment sequence tool)
4 – Sharing has to be defined a bit more formally (where alpha would some kind of tolerance cutoff to determine if two transformations are identical):
i j
![Page 58: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/58.jpg)
Comparing and aligning structures
VAST (Vector alignment sequence tool)
4a – Every time we have a “match”, we draw a link between two The result would be a so-called graph with connection only between similar set of rotation/translation.
i j
i
i
i i
i
i
i
i
i
i
i
i
ii
i
i
i
i
i
![Page 59: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/59.jpg)
Comparing and aligning structures
VAST (Vector alignment sequence tool)
5 – Once the problem is abstracted into a ‘graph’, it is possible to use the computational bag-of-tricks to figure out which set of connected matrices forms the largest group. The average rotation/translation in this group would best superimpose protein A and B.
i
i
i i
i
i
i
i
i
i
i
i
ii
i
i
i
i
i
![Page 60: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/60.jpg)
Comparing and aligning structures
VAST (Vector alignment sequence tool)
7 – The alignment is performed irrespective of the sequence order of the structural elements. This is good because it can catch circularly permuted proteins. But it also enhances the chances to find match by accident.
![Page 61: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/61.jpg)
Comparing and aligning structures
VAST (Vector alignment sequence tool)
8 – Statistical vallidation. This is a very important step since there is only a limited number of ways a small number of SSE will interact. Thus, sampling in a large database of random structure would still return a distribution of “hits”.
This is second hand information:
The p-value is the probability to observe a similar score by chance multiplied by the number of possible alternative substructures within the comparison.
The default cutoff = 0.05. Which should be regarded as a noise reduction cutoff, not a bulletproof jacket.
![Page 62: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/62.jpg)
Comparing and aligning structures
CE (Combinatorial extension)
CE doesn’t uses secondary structure elements as basic aligning unit. Instead, it seeks the optimal path amongst all possible n-mers between two query proteins.
1 – Given two proteins A and B of length nA and nB. CE will search for the longest continuous path P of aligned fragment peptides (AFP) of length m.
![Page 63: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/63.jpg)
Comparing and aligning structures
CE (Combinatorial extension)
4 – Some distance metric has to be made up to score AFP alignment
1
, ,0
1i k j k i k j k
mA B
ij p p p pk
D d dm
1 1
, ,20 0
1i k j l i k j l
m mA B
ij p p p pl k
D d dm
Each residue is counted once.
Each residue is counted against all.
Using RMSD
![Page 64: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/64.jpg)
Comparing and aligning structures
CE (Combinatorial extension)
4 – Pathfinding
There is a substantial decrease in the size of the search space by restricting the value of G
There is a substantial decrease in the size of the search space by restricting the value of G
1 – Select all possible next AFP under a certain (self) threshold.
2 – Consider the path to chose the best next AFP.
3 – Choose whether to pursue the extension or not.
![Page 65: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/65.jpg)
Comparing and aligning structures
CE (Combinatorial extension)
4 – Statistics uses a z-score which compares path of similar length and score to a random sampling from a reference database.
z-score of 3.5 -> p-value of 10E-3
So, given about 2000 different protein folds, such threshold would imply two fortuitous hits. Visual inspection must be done as well as a more restrictive threshold should be used.
![Page 66: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/66.jpg)
Comparing and aligning structures
CE (Combinatorial extension)
Structural similarity between Acetylcholinesterase and Calmodulin found using CE (Tsigelny et al, Prot Sci, 2000, 9:180)
![Page 67: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/67.jpg)
SCOP database
http://scop.berkeley.edu/ Seen as the golden standard for protein structure classification
Query for structures given a protein sequence
Browse protein architecture organized in a hierarchical fashion.
Keyword search for structures.
Fold Common topology for secondary structure
Superfamilies probable common evolutionary origins, low sequence ID
Families (common evolutionary origins)
domains
individual domains
![Page 68: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/68.jpg)
CATH database
http://www.biochem.ucl.ac.uk/bsm/cath/ Involves manual inspection and classification, especially at more abstract levels such as the architecture-level.
CLASS secondary structure composition
Architecture what would be know as fold in SCOP)
Topology (What would be known as superfam.)
Sequence-level
![Page 69: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/69.jpg)
Summary
Aligning protein structure can detect homologous relationship that are deeper that sequence alignment because structures are more stable over time.
VAST abstracts proteins into SSE, or secondary structure elements and find the set of rotation/translation that maximize the number of paired SSE.
CE looks for the best alignment frame to superimpose a protein into another.
Statistics are important because it is likely that small unrelated structures will resemble each other.
![Page 70: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/70.jpg)
Summary
A distribution of random protein scores can be generated by aligning unrelated proteins in the databases. An alignment score must be significantly larger than score expected in this distribution.
This type of analysis is used to classify protein folds and infer relationship between structural evolution and biological activity.
Try to find structural neighbors of the protein 1AZT while browsing the NCBI website ( www.ncbi.nlm.nih.gov ).
![Page 71: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/71.jpg)
Molecular Modeling
Lecture 4
![Page 72: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/72.jpg)
Why modeling proteins
Example of applications
Modeling the binding site of the anticodon on eRF3
Modeling substrate binding in the active site of Mandelate racemase.
Solving X-ray and NMR structures.
The theory behind the calculation
Parametrizing protein models
Molecular mechanics as an optimization problem
Molecular mechanics as a time simulation
Conceptual clash between protein folding and molecular mechanics.
![Page 73: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/73.jpg)
Why modeling proteins
Anticodon binding site on eRF32 possibilities.
From phylogenetic information, a few residues were identified as players.
Use molecular mechanics to “see” whether the surface of the protein ca accommodate
an anti-codon.
![Page 74: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/74.jpg)
Why modeling proteins
Modeling a weird substrate into an active site.
Mandelate racemase can bind a substrate with two rings! Is there room for this in the wild type active site?
The answer is yes, although a bit counter-intuitive.
![Page 75: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/75.jpg)
How do structures are viewed
Pre-computer days
Sir John Kendrew and his model of insulin, 1958
![Page 76: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/76.jpg)
How do structures are resolved
X-ray diffraction PrincipleCreate a lattice of protein into a crystal.
Collect thousands of diffraction pattern in all degree of freedom rotational space.
Substract the phase between the layers in the lattice.
Compile into a 3D volume based on density of reflective material (electrons in this case).
Thread model into density map, optimize the geometry using the density map as an additional criterion.
![Page 77: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/77.jpg)
How do structures are resolved
NMR spectroscopyPrincipleUse magnetic fields and “radio” frequency photons to detect shifts in nuclear states.
Assign shifts to a model along the chain.
Correlate the mutual effect amongst elements on each other to come up with a list of constraints (typically distances).
Optimize the trajectory of the modeled chain, given this list of constraints.
![Page 78: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/78.jpg)
How do structures are resolved
NMR spectroscopyPrincipleUse magnetic fields and “radio” frequency photons to detect shifts in nuclear states.
Assign shifts to a model along the chain.
Correlate the mutual effect amongst elements on each other to come up with a list of constraints (typically distances).
Optimize the trajectory of the modeled chain, given this list of constraints.
![Page 79: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/79.jpg)
Physical simulation in Molecular Modeling
Jensen, F., 1999, Introduction to computational chemistry, Whiley,
Chichester, UK
Why is it useful to you?
Modeling is used often by experimental biochemists and is a staple in structural biology.
The complexity of the simulation is far beyond the complexity of the interface. This necessarily convey a false sense that the “defaults” settings will do fine.
![Page 80: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/80.jpg)
Physical simulation in Molecular Modeling
Limitations
True atoms and bonds are probabilistic constructs. The computation of the resulting geometries is a very involved process for which the analytical equations are not fully worked out.
Luckily, the observable behavior is much more predictable and thus can be modeled under a limiting set of assumptions.
![Page 81: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/81.jpg)
Physical simulation in Molecular Modeling
Assumptions
Newtonian physics is used to simulate molecules under a set of restrictions which for proteins would be:
1. In solution (or vacuum).
2. Near room temperature.
3. Chemically inert.
![Page 82: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/82.jpg)
Physical simulation in Molecular Modeling
AbstractionEach atoms has a fix geometry constrained by a somewhat arbitrary energy scoring scheme.
The problem thus boils down to find the best set of coordinates for all atoms to minimize the energy.
There are no absolute correspondence between this scoring scheme and experimentally measurable energy values.
![Page 83: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/83.jpg)
Molecular Modeling in Bioinformatics
Modeling
Although there is only a small subset of all possible atoms that end-up in biological molecules. Each atoms has a set of different states in which they exist. These states are referred to as types in molecular modeling.
![Page 84: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/84.jpg)
Molecular Modeling in Bioinformatics
Energy function
The energy function is used to evaluate and calculate the derivatives use to optimize a structure.
FF str bend tors VdW el crossE E E E E E E
O N O N O N 2O N
2O N 2O N
![Page 85: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/85.jpg)
Molecular Modeling in Bioinformatics
Computational efficiency and limitations of the model
The energy function is used to evaluate and calculate the derivatives use to optimize a structure.
02 AB ABr
a Bt
b As kP E R R R
2
2 0
3 4
3 0 4 04 AB AB AB ABstr
ab AB ab AB ab ABk R kP E R R R RR k R
1ABAB R
strMorse eDE R
![Page 86: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/86.jpg)
Molecular Modeling in Bioinformatics
Parameterization nightmare
Can someone come up with all these numbers?
Generalization
How robust is the simulation in a range of conditions.
Computational cost
The longer it takes to perform a single task, the fewer iterations will be computed in the same amount of time.
![Page 87: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/87.jpg)
Molecular Modeling in Bioinformatics
Parameterization nightmare
Can someone come up with all these numbers?
For MM2 forcefield (71 atom types):
Term Params(est.) Determined
E(VdW) 142 142
E(str) 900 290
E(bnd) 27000 824
E(tors) 1215000 2466
E(cross) 107-8 ?
hc
E
![Page 88: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/88.jpg)
Molecular Modeling in Bioinformatics
Generalization
How robust is the simulation in a range on conditions.
In the example to the left, the EXP.-6 model causes nuclear fusion at unrealistic distances.
Such unrealistic distance will be found in Monte-Carlo, Genetics Algorithms and Simulated annealing experiments.
![Page 89: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/89.jpg)
Molecular Modeling in Bioinformatics
Lennard-Jones
Is actually a computational stunt so there is no need to compute R but rather use Rn where n is an even factor.
12 6
( ) o oR RE R
R R
2 2 2
ij i j i j i jR x x y y z z
6 6( ) BR C
AEXP R eR
![Page 90: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/90.jpg)
Molecular Modeling in Bioinformatics
Lennard-Jones
In practice, Lennard-Jones is optimized to reproduce validated results (and works out satisfactorily).
![Page 91: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/91.jpg)
Molecular Modeling in Bioinformatics
Electrostatic Models
… are real ugly.
Why does this matter?
Electrostatic fields decay with 1/ distance. Which makes them the longest-ranged interactions.
Examples
Coulomb’s Law
( )el AA
BAB
BEQ Q
RR
![Page 92: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/92.jpg)
Molecular Modeling in Bioinformatics
Electrostatic Models
… are real ugly.
Why does this matter?
Electrostatic fields decay with 1/ distance. Which makes them the longest-ranged interactions.
Examples
Dipolar moment interactions
3( ) cos 3cos cosA B
el AB A B
AB
E RR
![Page 93: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/93.jpg)
Molecular Modeling in Bioinformatics
Computational cost of non-bonded energy (VdW, El)
~99.88% of computation in protein-sized models. Most of this is very small and does not contribute to the total energy significantly.
Computational tricks
Cutoff -> blending function -> neighbor list*
*must be updated O(N2)
Validation
1. Reproduces Geometries
2. Reproduces Relative energies.
![Page 94: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/94.jpg)
E is not G
G H TS Real energies are temperature-dependant.
Entropic contribution cannot be calculated from a snapshot.
![Page 95: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/95.jpg)
Principle of optimization
You start with a protein for which you know all coordinates.
Evaluate the energy
Find a better structure, usually with small changes
Repeat until no better structure can be found.
This task is usually NEVER straightfoward, unless the system would be made of a small number of atoms.
![Page 96: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/96.jpg)
Molecular Modeling in Bioinformatics
Optimization (local minima)Straightforward, although computationally expensive.
1 – A clear equation.2 – A defined set of variables.3 – “only” three dimension to worry about
Steepest Descent (Robust, fast)Conjugate Gradient (Improved convergence properties)Newton-Raphson (Saddle points)
Pseudo-NR (progressive Hessian estimate)
![Page 97: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/97.jpg)
Molecular Modeling in Bioinformatics
Optimization (Global minimum)In a simple, circular, system with 17 main-chain atoms. There are 262 distinct conformations within 3 kcal/mol from the global minimum (out of ~1.6E13 conformers).
The size of proteins is 1-2 order of magnitude larger.
Stochastic Methods (Monte-Carlo)Molecular DynamicsSimulated AnnealingGenetic AlgorithmsStatistical Mechanics
![Page 98: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/98.jpg)
Molecular Modeling in Bioinformatics
Time dependent methods (Molecular Dynamics)
Make use of classical mechanics equations such as:
F ma
2 31
1 1...
2 6i i i i ir r v t a t b t
2 31
1 1...
2 6i i i i ir r v t a t b t 2
1 12i i i ir r r a t
Verlet AlgorithmNumerical solution to Newton’s equations
![Page 99: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/99.jpg)
Molecular Modeling in Bioinformatics
2 31
1 1...
2 6i i i i ir r v t a t b t
2 31
1 1...
2 6i i i i ir r v t a t b t 2
1 12i i i ir r r a t
Verlet AlgorithmNumerical solution to Newton’s equations
Problems with this methods
No explicit use of speed (which is needed to calculate the total energy):
2
1
1
2
N
Tot i ii
E m v U r
![Page 100: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/100.jpg)
Molecular Modeling in Bioinformatics
2 31
1 1...
2 6i i i i ir r v t a t b t
2 31
1 1...
2 6i i i i ir r v t a t b t
1 12
i i ir r v t
Leapfrog AlgorithmNumerical solution to Newton’s equations
TimestepReasonable: Femtoseconds 10-15
Scope of simulation (ideal): Millisecond 10-3
(practical): Microsecond 10-6
21/ 2 1/ 2i i iv v a t
![Page 101: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/101.jpg)
Molecular Modeling in Bioinformatics
Simulated AnnealingRobustness vs. initial solution
Variable contribution of the objective function.Broader Sampling.
Both help to explore around a minimum.
F U K
potential
Blending functionKinetic
Net Movement
![Page 102: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/102.jpg)
Protein folding from Scratch
Must be restrained to a limited scope
Two genes: TC5b and TC3b
Both have references structure for validation.
Sequences
NLYIQWLKDGGPSSGRPPPS (TC5b; 304 atoms)
NLFIEWLKNGGPSSGAPPPS (TC3b; 289 atoms)
Software: AMBER 6.0
Model: AMBER
Solvation: Generalize Born/solvent-accessible surface area
This means that the water molecules are not explicitly defined in the simulation and the effect of the solvent is treated as a macro
property.
![Page 103: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/103.jpg)
Protein folding from Scratch
Must be restrained to a limited scope
Understanding folding and design: Replica-exchange simulation of “Trp-Cage” miniproteins.
Pitera, JW., Swope, W. 2003. Proc. Natl. Acad. Sci. USA, 100: 7587-7592
![Page 104: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/104.jpg)
Protein folding from Scratch
Algorithm
Initialization
Input: A protein sequence
Output: A starting structure for the main simulation.
1: Thread each character from the input sequence to a 3D corresponding model (extended).
2: Minimize with 5000 steps of steepest descent
3: for i = 1 to 50000 do
Simulate with Molecular Dynamic
if !(i%1000) then Readjust the temperature 298K.
4: Return equilibrated model.
Required to prevent strong “jerking” motion in the first iteration of a simulation
![Page 105: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/105.jpg)
Protein folding from Scratch Algorithm
Simulation (simulated annealing variant)
Input: P, An equilibrated protein model
Output: A collection of coordinate snapshots (trajectory) for analysis.
1: T = a list of 23 simulation temperatures from 250K to 603K.
2: E = {} , an empty list of experiments
3: for i = 1 to |T| do
4: Pi = Copy P
5: Set the temperature of Pi to Ti
6: Add Pi to E
7: for i = 1 to 4,000,000 (4 ns) do
8: Simulate using MD |in parallel|
9: if i % 250 == 0 then take a Snapshot of coordinates.
10: if i % 5000 == 0 then
11: Swap temperature between process (Metropolis-style probabilities)
12: Adjust each E to their new simulation temperature
13: Discard all but the snapshot taken in the last nanosecond of simulation.
14: Pool all 23 experiments for analysis.
![Page 106: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/106.jpg)
Protein folding from Scratch
Computational cost
Ridiculously small protein, no initial good guess.
19 days on 23 200 MHz IBM POWER3 SP2 processor (R6000 series)
Which, on the campus here, approaches the mean time between power outage!
![Page 107: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/107.jpg)
Protein folding from Scratch
Validation
The root mean square deviation RMSD
2
1
n
i refatom i
n
ii
w i i
RMSDn w
Which is a suitable distance metric for related structures.
![Page 108: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/108.jpg)
Comparing and aligning structures
Why?
There is a need for a distance metric to compare similar protein structure.
Simulation analysis.
Similarity quantification.
Pattern detection.
RMSD
Works well for closely similar structure.
2
1
n
i refatom i
n
ii
w i i
RMSDn w
![Page 109: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/109.jpg)
Comparing and aligning structures
RMSD
Works well for closely similar structure.
2n
refatom i
i i
RMSDn
Absolutely require some kind of pair wise equivalence between the two compared
entities,
![Page 110: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/110.jpg)
Comparing and aligning structures
RMSD
Sequence identity falls quickly.
Hard to separate weak hits from purely random proteins.
2
1
n
i refatom i
n
ii
w i i
RMSDn w
![Page 111: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/111.jpg)
Protein folding from Scratch
Validation
The root mean square deviation RMSD
2
1
n
i refatom i
n
ii
w i i
RMSDn w
≤2.0 RMSD
from any of 38 experimental structures
≤2.0 RMSD from the
average low temperature structure.
![Page 112: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/112.jpg)
Protein folding from Scratch
Impact
Impact of this paper
Make good use of parallelism to conduct a heuristic search.
Sampling-based method.
Promising because in many cases the folding of a large
protein can be approximated to the folding of its components.
(Remember, domains are independent units in most
cases)
![Page 113: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/113.jpg)
Building a large machine for molecular modeling
IBM Blue Gene project
Architecture
64K FPU
20K FPU (protein folding)
FPU 64-bit @ 700 MHz (low cost, low heat)
64 compute nodes (256 MB) per I/O nodes (512MB)
MPI library
3D torus network for fast neighbor to neighbor communication.
![Page 114: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/114.jpg)
High Performance achievement in MD NAMD
Open source
University of Illinois, Dept. of theoretical physicshttp://www.ks.uiuc.edu/Research/namd/
Benchmark system
(their big one)
![Page 115: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/115.jpg)
High Performance achievement in MD NAMD
Open source
There is no need to use this system to study protein folding.
Instead, MD were used in this case to study the conversion of torque into energy that can be stored in molecular batteries: the ATP molecule.
![Page 116: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/116.jpg)
Overview
Protein folding and parallel computing.
Homology modeling and statistical mechanics.
Secondary structure prediction and artificial intelligence.
![Page 117: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/117.jpg)
Spectrum of strategies
Physics Knowledge
Quantum mechanics Molecular Mechanics Statistical Mechanics Homology Modeling
![Page 118: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/118.jpg)
Parallel computing and Molecular dynamics
Folding protein from an extended conformation is a difficult problem because of the crossing of energy
barriers.
The following slides describe how crossing barrier can be achieved using a technique called parallelization.
![Page 119: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/119.jpg)
Parallel computing
It takes 1500 days to complete a thesis for one student
If the student is helped by someone, the work may go 2X as fast: 750 days.
What if 1500 students are working on the same thesis?
Overhead
Communication
Load balancing
![Page 120: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/120.jpg)
Parallel computing
Factors that complicate parallelization:
Some work have to be executed in a sequence
Communicating the task and the results becomes an increasingly important time step as the task become small.
Each individual process have to wait for the slowest one to finish, leading to a loss of efficiency.
![Page 121: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/121.jpg)
Time scale in protein folding
In the order of micro to milliseconds
This is not achievable by modern computers.
~10 000 days for 1 experiment (~28 years)
folding@home
Hundreds of million computer idle at any time
Why not use their unspent cycles.
Create a “screen saver”
![Page 122: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/122.jpg)
Crossing energy barrier
Most of the time is spent waiting for the thermal motion to topple a structure over a barrier.
Principle of Ensemble dynamics
M CPU should take M X less time to go over a barrier.
K = 1/10,000 ns , M = 10,000 , t = 30 ns
f(t) ~ 30 folding events
( ) 1 exp( )f t kt
![Page 123: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/123.jpg)
Ensemble Dynamics
Start M dynamic calculations with the same initial structure.
Once 1 thread finds a barrier and go over it, copy the state of this thread into all other M
replicate processes.
The communication overhead is negligible if the crossings
are rare events, which is true in this case.
![Page 124: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/124.jpg)
Ensemble Dynamics
Detecting a barrier
Will be noticeable by a large variance in energy over the duration of the simulation.
![Page 125: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/125.jpg)
Ensemble Dynamics
Calculation details
We simulated folding and unfolding at 300K at pH 7.0,
using OPLS parameters set to Generalized Borne implicit
solvent model.
Time step 2 fs
Long range interaction truncated with a 16A cutoff.
![Page 126: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/126.jpg)
What are they doing with this technique?
![Page 127: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/127.jpg)
A more complex system
Note how most of the interactions in the partially
folded protein are non-native.
This means that in order to resume folding, these must
be broken.
The Villin headpiece is one of the fastest (known) folding peptide !! What
about simulating anything else?
![Page 128: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/128.jpg)
Energy Landscape
It is clear in this figure that there are:
1. one folding pathway
2. One intermediate
3. Two energy barriers
![Page 129: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/129.jpg)
![Page 130: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/130.jpg)
Statistical Mechanics
Practical definition for our purpose:
Statistical mechanics can be used to create predictive models in absence of theoretical models.
For example: interaction between amino-acids.
![Page 131: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/131.jpg)
Statistical mechanics
Atom-level simulation are expansive, and empirical.
Statistical mechanics bridges frequencies of observations with physical forces for chemical systems.
The resulting model is thus used to assess the “energy” of a trial conformation and can be used as an objective function to optimize a
solution.
This technique is increasingly used in bioinformatics since the information in the database can be seen as the collection of
observation at equilibrium.
![Page 132: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/132.jpg)
Statistical mechanics
In other words, if it can’t be seen in the database, the energy state of an observation must be high. If its
common, the energy must be low.
Remember, everything is possible, the probability of an observation is related to its relative energy.
lni iE RT f
![Page 133: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/133.jpg)
How does this ties in to bioinformatics?
There is a direct relationship between energy of a state in a system in equilibrium and the probability to observe this
state.
lni iE RT f
ln ii
ii
nE RT
n
![Page 134: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/134.jpg)
What are “states” in protein structures?
There are a lot of freedom in defining states for protein structures. Here is one example:
Sequence
Sequence
ContactsIn this plot, if two positions of
the 1D sequence are in physical contact, it is marked
as an orange pixel.
It is thus possible to harvest from a collection of structures a matrix of observed contacts.
![Page 135: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/135.jpg)
What are states in protein structures?
There are a lot of freedom in defining states for protein structures. Here is one example:
Sequence
Sequence
ContactsIn this case the energy for any
given pair would be:
( , )
,ln
,Pair a b
i
n a bE RT
n a i
![Page 136: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/136.jpg)
What are states in protein structures?
There are a lot of freedom in defining states for protein structures. Here is one example:
Sequence
Sequence
In order for this value to be valid; there is an assumption
of equilibrium.
Equilibrium:
The sampling would not change significantly over time.
![Page 137: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/137.jpg)
What are states in protein structures?
There are a lot of freedom in defining states for protein structures. Here is one example:
Sequence
Sequence
PitfallIn order to be accurate for rare observation, the total number
of observation should be infinitely large and derived
from sequences-structures in equilibrium.
Practically, there should be enough instances of the rarest entry to avoid large errors on
the estimate (log(0)).
![Page 138: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/138.jpg)
What are states in protein structures?
There are a lot of freedom in defining states for protein structures. Here is one example:
Sequence
Sequence
Miyazawa-Jernigan Matrix
Such matrix has been generated
Miyazawa, S.,Protein Eng. 1993 Apr;6(3):267-78
This is particularly useful for threading sequences in known structures for structure prediction purpose.
![Page 139: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/139.jpg)
What are states in protein structures?
The implementation of a distance-based energy term is trickier… but boils down to the same thing.
Knowledge-based force-field
Need to store in 4D matrices the tuple
{ (a,b), r, k }
R distance in Euclidian space
K distance in sequence space
x1
x2
x3
x4 x5
x6
x7x8
x9
r
k = 6
![Page 140: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/140.jpg)
What are states in protein structures?
The energy will be calculated with respect to all parameters considered.
Knowledge-based force-field
x1
x2
x3
x4 x5
x6
x7x8
x9
r
k = 6
( , , , )
,
,ln
kr
kr
iPair a b r k k
rk
n a b
n a iE RT
nn
![Page 141: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/141.jpg)
What are states in protein structures?
There are some implementation for this technique, such as PROSAII
http://www.came.sbg.ac.at/Services/prosa.html
Knowledge-based force-field
x1
x2
x3
x4 x5
x6
x7x8
x9
r
k = 6
( , , , )
,
,ln
kr
kr
iPair a b r k k
rk
n a b
n a iE RT
nn
![Page 142: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/142.jpg)
What are “states” in protein structures?
The exposure of each site to the exterior is an important factor. This is often quantified as Solvent Accessible Area
(ASA)
Knowledge-based force-field
Need to store in 2D tuple
{ a, ASA }
,
{ }
,ln
,a ASA
i ASA discretization
n a ASAE RT
n a i
![Page 143: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/143.jpg)
What are states in protein structures?
Ultimately, the energy of seeing a given sequence adopt a given structure can be computed as follow:
Knowledge-based force-field
Tot Pairs Solv otherE E E E Caveats
The finer is the parameterization, the larger must be the reference collection of (appropriate) structures in the database in order to observe many times all possibilities.
Design-level decision as to the choice of the minimum set of terms to fully define a structure.
![Page 144: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/144.jpg)
An example
Real life example of using Knowledge-based methods.
This enzyme is called Enolase. It is a key enzyme in the sugar breakdown metabolism.
If there are important terms that are forgotten, the energy values may be inadequate.
![Page 145: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/145.jpg)
An example
Real life example of using Knowledge-based methods.
The function and the composition are very tightly related.
Red negatively chargedBlue positively chargedTan Hydrophobic
These are the active site residues.
![Page 146: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/146.jpg)
An example
Real life example of using Knowledge-based methods.
The critical region in this protein has radically different properties than expected in an average protein. The knowledge-based system does not account for these properties and thus, the position shown in white were poorly estimated.
The way this assessment was done quantitatively goes well beyond the scope of this course.
![Page 147: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/147.jpg)
An other example
Cubic lattice simulation
The dimensionality of the protein folding problem can be reduced by simplifying the geometric properties of the system.
Knowledge-based energy evaluation can be used as an objective function that is relevant to the physical world, without the need to fully define a system with the 6 degrees of freedom.
![Page 148: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/148.jpg)
Spectrum of strategies
Physics Knowledge
Quantum mechanics Molecular Mechanics Statistical Mechanics Homology Modeling
![Page 149: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/149.jpg)
Homology Modeling
Homology
Related by a common ancestor.
Sequence identity amongst homologous structure can be as low as 15%.
Why making models?
There is a good chance that the structural efforts will never catch up with the sequencing projects.
How?
Figure out the most probable 3D structure, given a (1D) sequence and a 3D template from a related protein.
![Page 150: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/150.jpg)
Homology Modeling
Assumption
•Regions of alignable sequence share homologous structures•Loop regions (non-conserved residues) allow insertions and deletions without disrupting the overall structure of a protein.
Query sequence
Sequence Similarity to
Solved structure?
PSI-Blast/profile MSASecondary Structure Prediction
Fold prediction
Homology Modeling Model Validation
![Page 151: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/151.jpg)
Homology Modeling
Aligning a sequence and a structure
MSA (multiple sequence alignment) between the query and the sequence of the target structure.
Profile MSA – The query and a MSA of homolog proteins to the target structure.
Threading.
![Page 152: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/152.jpg)
Homology Modeling
Principle of threading
“Pull” a sequence through a structure such that the alignment correspond to the frame with the best energy score.
![Page 153: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/153.jpg)
Homology Modeling
Energy evaluation for threading
Statistical mechanics is ideal in this case because physical models would require extensive simulation time to figure out the precise atomic conformation.
![Page 154: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/154.jpg)
Homology Modeling
Threading to detect correct alignments
The application GenTHREADER uses threading to perform protein fold recognition from genomic sequences.
![Page 155: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/155.jpg)
Homology Modeling
General Principle
1. Align to the sequence of a known structure.2. Change the structure of the side-chains to match the query
sequence according to the sequence alignment.3. Model loops and variable regions.4. Minimize energy / conformational search5. Check models for inconsistencies.
Feasibility
> 40% sequence identity is preferable.25% - 40% “Twilight Zone”< 25% Insufficient similarity in most cases.
May work only for one domain out of the whole protein.
![Page 156: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/156.jpg)
Neural Network
Anatomy of a NN:
Input parameters Output parametersWeights
![Page 157: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/157.jpg)
Neural Network
Before a NN can be used, it must be trained:
Training compared the output of a NN with a known answer, the weight of each “arrows” is changed to minimize the error.
![Page 158: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/158.jpg)
Secondary Structure prediction
Three Generations of methods
Generation Approach
1 (’60-’70)
GOR1
Single character statistical information
~ 57% ACC
2 (‘80)
GOR3
Local interactions
~ 63% ACC
3 (’90+)
PHD
Homologous protein sequences
~ 72% ACC
![Page 159: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/159.jpg)
Secondary Structure prediction
1ST Generation
Making use of compiled frequencies of the different characters for three possible classes:
Helix (H)
Strand (S)
Coild (-)
SDFDKILVSTYSPPQARILIVM
-----SSSSSSS----HHHHHH
![Page 160: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/160.jpg)
Secondary Structure prediction
2nd Generation
Making use of compiled frequencies of the different characters for three possible classes.
Considering the periodicity and neighbors.
Sliding window analyses
SDFDKILVSTYSPPQARILIVM
-----SSSSSSS----HHHHHH
![Page 161: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/161.jpg)
Secondary Structure prediction
3rd Generation
S D F ... M
0.1 0.01 0.0 ... 0.0
0.0 0.98 0.1 ... 0.09
... ... ... ... ...
0.02 0.0 0.05 ... 0.7
Frequency vectors obtained from multiple sequence alignments.
These MSA can be generated using BLAST
or Psi-BLAST
Also known as profiles
![Page 162: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/162.jpg)
Secondary Structure prediction
Best done using Neural Networks (or HMM… )
3rd Generation
S D F ... M
0.1 0.01 0.0 ... 0.0
0.0 0.98 0.1 ... 0.09
... ... ... ... ...
0.02 0.0 0.05 ... 0.7
H H - … S
The NN output of the profiles gets scanned by a few, distinct, NNs using a sliding window
strategy.
Assignment on the basis of the “winner
takes all”.
![Page 163: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/163.jpg)
Secondary Structure prediction
Alignment grow, secondary structure prediction improvesPrzybylski, Rost. 2002. Proteins, 46:197-205
Conlcusions
•Using MSA (multiple sequence alignment) significantly improve the predictions (0.72 -> 0.75)
•The larger the dB used, the better. However, there is a point where the information content saturates.
•Psi-BLAST vs BLAST: BLAST may be better in some cases.
•Refining the alignment did not help.
![Page 164: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/164.jpg)
Secondary Structure prediction
Bidirectional Dynamics for protein secondary structure prediction
Baldi et al., 2000, in Sequence learning, pp. 80-114
IOHMM model
Memory evaluated experimentally at about 15 characters
![Page 165: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/165.jpg)
Secondary Structure prediction
Bidirectional Dynamics for protein secondary structure prediction
Baldi et al., 2000, in Sequence learning, pp. 80-114
Recurrent Neural Network implementation
![Page 166: Protein folding](https://reader035.fdocuments.us/reader035/viewer/2022062409/568148ad550346895db5c053/html5/thumbnails/166.jpg)
Overview
Protein folding and parallel computing.
Current simulation works for modest-sized systems.
Homology modeling and statistical mechanics.
There is a clear advantages to use the information that we already have to solve new problems.
Secondary structure prediction and artificial intelligence.
Machine learning is appropriate to capture the trends leading to prediction.