COMP 564: Protein Secondary Structure Prediction › ~jeromew › teaching › 564 › W... ·...
Transcript of COMP 564: Protein Secondary Structure Prediction › ~jeromew › teaching › 564 › W... ·...
COMP 564: Protein Secondary Structure Prediction
Jérôme WaldispühlSchool of Computer Science
McGill University
Protein Secondary Structure
Protein Secondary Structure Prediction Using Statistical Models
• Sequences determine structures
• Proteins fold into minimum energy state.
• Structures are more conserved than sequences. Two proteins with 30% identity likely share the same fold.
How to evaluate a prediction?
correctly predicted residues number of residues
In 2D: The Q3 test.
=3Q
In 3D: The Root Mean Square Deviation (RMSD)
• First generation – single residue statistics Fasman & Chou (1974) : Some residues have particular secondary structure preference. Examples: Glu α-Helix Val β-strand
Old methods
C . 1 3 1 3 33 35 , 5 3. 3 1 .4 3.
Difficulties
Bad accuracy - below 66% (Q3 results).
Q3 of strands (E) : 28% - 48%.
Predicted structures were too short.
Methods Accuracy Comparison
3rd generation methods
• Third generation methods reached 77% accuracy.
• They consist of two new ideas: 1. A biological idea – Using evolutionary information. 2. A technological idea – Using neural networks.
How can evolutionary information help us?
Homologues similar structure
But sequences change up to 85%
Sequence would vary differently - depends on structure
How can evolutionary information help us?
In defined secondary structures.
In protein core�s segments (more hydrophobic). In amphipatic helices (cycle of hydrophobic and hydrophilic residues).
Where can we find high sequence conservation?
Some examples:
How can evolutionary information help us?
• Predictions based on multiple alignments were made manually.
Problem: • There isn�t any well defined algorithm!
Solution: • Use Neural Networks .
Artificial Neural Network
The neural network basic structure : • Big amount of processors – �neurons�. • Highly connected. • Working together.
Artificial Neural Network
What does a neuron do? • Gets �signals� from its neighbors.
• When achieving certain threshold - sends signals. • Each signal has different weight.
1s
2s
3s W3
W1
W2
Artificial Neural Network General structure of ANN :
• One input layer.
• Some hidden layers.
• One output layer.
• Our ANN have one-direction flow !
Artificial Neural Network
Neural network
Test set
Training set
Correct
Incorrect
Network training and testing :
Back - propagation
• Training set - inputs for which we know the wanted output. • Back propagation - algorithm for changing neurons pulses �power�. • Test set - inputs used for final network performance test.
Artificial Neural Network The Network is a �black box�: • Even when it succeeds it�s hard to understand how. • It�s difficult to conclude an algorithm from the network. • It�s hard to deduce new scientific principles.
Structure of 3rd generation methods
Find homologues using large data bases.
Create a profile representing the entire protein family.
Give sequence and profile to ANN.
Output of the ANN: 2nd structure prediction.
Structure of 3rd generation methods
The ANN learning process: Training & testing set: - Proteins with known sequence & structure. Training:
- Insert training set to ANN as input. - Compare output to known structure. - Back propagation.
3rd generation methods - difficulties
Main problem - unwise selection of training & test sets for ANN.
• First problem – unbalanced training
Overall protein composition: • Helices - 32% • Strands - 21% • Coils – 47%
What will happen if we train the ANN with random segments ?
3rd generation methods - difficulties
• Second problem – unwise separation between training & test proteins
What will happen if homology / correlation exists between test & training proteins?
Above 80% accuracy in testing. over optimism!
• Third problem – similarity between test proteins.
Protein Secondary Structure Prediction Based on Position – specific Scoring Matrices
David T. Jones
PSI - PRED : 3RD generation method based on the iterated PSI – BLAST algorithm.
PSI - BLAST
Sequence
Distant homologues
PSSM - position specific scoring matrix
• PSI – BLAST finds distant homologues. (It exists now alternatives such as HMMER 3.0 or HHblits) • PSSM – input for PSI - PRED.
PSI - PRED ANN�s architecture:
1ST ANN
2ND ANN
• Two ANNs working together.
Final prediction
Sequence + PSSM
Prediction
PSI - PRED Step 1: • Create PSSM from sequence - 3 iterations of PSI – BLAST.
Step 2: 1ST ANN • Sequence + PSSM 1st ANN�s input.
A D C Q E I L H T S T T W Y V 15 RESIDUES
output: central amino acid secondary state prediction. A D C Q E I L H T S T T W Y V
E/H/C
PSI - PRED Using PSI - BLAST brings up PSI – BLAST difficulties:
Iteration - extension of proteins family
Updating PSSM
Inclusion of non – homologues
�Misleading� PSSM
PSI - PRED Step 3: 2nd ANN • So why do we need a second ANN ? possible output for 1st ANN:
A A P P L L L L M M M G I M M R R I M E E E E E C C C C C H C C C C C E E E
what�s wrong with that ?
seq pred
one-amino-acid helix doesn�t exist
Solution: ANN that �looks� at the whole context !
Input: output of 1st ANN. Output: final prediction.
PSI - PRED Training :
Testing : • 187 proteins, Highly resolved structure.
• Without structural similarities.
• PSI – BLAST was used for removing homologues.
Balanced training.
PSI - PRED Jones�s reported results : Q3 results : 76% - 77%
PSI - PRED Reliability numbers:
• Used by many methods.
• Correlates with accuracy.
• The way the ANN tells us how much it is sure about the assignment.
Performance Evaluation
• Many 3rd generation methods exist today.
Which method is the best one ? How to recognize �over-optimism� ?
• Through 3rd generation methods accuracy jumped ~10%.
Performance Evaluation
Performance Evaluation
Conclusion : PSI-PRED seams to be one of the most reliable method today.
Reasons :
• Strict training & testing criterions for ANN.
• The widest evolutionary information (PSI - BLAST profiles).
Improvements
3rd generation methods best results: ~77% in Q3 .
The first 3rd generation method PHD: ~72% in Q3.
Sources of improvement :
• Larger protein data bases. • PSI – BLAST PSI – PRED broke through, many followed...
Improvements How can we do better than that ?
• Combination of methods.
Through larger data bases (?).
Example: Combining 4 best methods Q3 of ~78% !
• Find why certain proteins predicted poorly.
Bibliography • Jones DT. Protein secondary structure prediction
based on position specific scoring matrices. J Mol Biol. 1999 292:195-202
• Rost B. Rising accuracy of protein secondary structure prediction 'Protein structure determination, analysis, and modeling for drug discovery� (ed. D Chasman), New York: Dekker, pp. 207-249
Residue Interaction Graph
• Each residue as a vertex
• Two residues interact if there is a potential clash between their rotamer atoms
• Add one edge between two residues that interact.
Residue Interaction Graph
a
b
c
d f
e
m
l k j
i
h
s
Key Observations • A residue interaction graph is a geometric
neighborhood graph – Each rotamer is bounded to its backbone position by a
constant distance – There is no interaction edge between two residues if their
distance is beyond D. D is a constant depending on rotamer diameter.
• Residue interaction graphs are sparse! – Any two residue centers cannot be too close. Their distance
is at least a constant C.
No previous algorithms exploit these features!
Tree Decomposition [Robertson & Seymour, 1986]
h
Greedy: minimum degree heuristic
a
b
c
d f
e
m
l k j
i
g
a c
d f
e
m
k j
i
h
g abd
l
2 2 3 2 3 3 5 . 1 2 2 . 3 312, .. .1 A 312, 2 2 2 2 2 , 35 2 1 2 3 A
Tree Decomposition (Cont�d)
Tree Decomposition
Tree width is the maximal component size minus 1.
a
b
c
d f
e
m
l k j
i
h
g abd acd
clk
cdem defm
fgh
eij
ab ac
clk
c f
fgh
ij
remove dem
Side-Chain Packing Algorithm 1. Bottom-to-Top: Calculate the
minimal energy function
2. Top-to-Bottom: Extract the optimal assignment
3. Time complexity: exponential to tree width, linear to graph size
{ }))(,())(,())(,())(,( min)A(
iililjijXX
iri XAXScoreXAXFXAXFXAXFri
++=−
The score of subtree rooted at Xi
The score of component Xi
The scores of subtree rooted at Xj
Xr
Xp Xi
Xj Xl Xq
Xir
Xji Xli
A tree decomposition rooted at Xr
The scores of subtree rooted at Xl
• For a general graph, it is NP-hard to determine its optimal treewidth.
• Has a treewidth – Can be found within a low-degree polynomial-time
algorithm, based on Sphere Separator Theorem [G.L. Miller et al., 1997], a generalization of the Planar Separator Theorem
• Has a treewidth lower bound – The residue interaction graph is a cube – Each residue is a grid point
Theoretical Treewidth Bounds
• Has a PTAS if one of the following conditions is satisfied: – All the energy items are non-positive – All the pairwise energy items have the same sign, and the lowest
system energy is away from 0 by a certain amount
Result (2) An optimization problem admits a PTAS if given an error ε (0<ε<1), there is a polynomial-time algorithm to obtain a solution close to the optimal within a factor of (1±ε).
Chazelle et al. have proved that it is NP-complete to approximate this problem within a factor of O(N), without considering the geometric characteristics of a protein structure.
Linear & Integer Program
• Linear programs can be solved within polynomial time
• No polynomial time for integer programs so far – Relaxed to linear program, solve the linear
version – Branch-and-bound or branch-and-cut (may cost
exponential time)
Protein Threading
• Make a structure prediction through finding an optimal alignment (placement) of a protein sequence onto each known structure (structural template) – “alignment” quality is measured by some statistics-based scoring
function – best overall “alignment” among all templates may give a structure
prediction
target sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE template library
Threading Model
• Each template is parsed as a chain of cores. Two adjacent cores are connected by a loop. Cores are the most conserved segments in a protein.
• No gap allowed within a core. • Only the pairwise contact between two core
residues are considered because contacts involved with loop residues are not conserved well.
• Global alignment employed
CASP5/CAFASP3 targets
30 12 20 # targets
FR HM hard HM easy CAFASP3
NF NF/FR FR(A) FR(H) CM/FR CM CASP5
Prediction Difficulty
CM: Comparative Modelling, HM: Homology Modelling FR: Fold Recogniton, NF: New Fold
Hard Easy
RAPTOR�s sensitivity on FR targets
5 5 6 7 7 # Targets 0 1 2 4 6 # Correct
NF NF/FR FR(A) FR(H) CM/FR
1. RAPTOR is weak at recognizing FR(A) targets (need improvement ) 2. RAPTOR could not deal with NF targets at all (normal)
Support Vector Machine (SVM) Regression (A.J. Smola et al)
),...,,( 21 dxxxx = ),...,( 21 daaaa =),...,,( 21 mbbbb =),...,,( 21 myyyy =Notation
Linear regression ε+= axf
If the relationship between f and x is not linear )(xy φ=
SVM regression: linear regression in a high-dimension space 'ε+= byf
dm >>
),()(),(, 212121 xxKxxyy >=>=<< φφ (1)
Data ),(),...,,(),,( 2211 nn fxfxfx
Condition:
Side Chain Properties
The amino acids names are colored according to their type: positively charged, negatively charged, polar but not charged, aliphatic (nonpolar), and aromatic. Amino acids that are essential to mammals are marked with an asterisk (*).
A, G, I, L, M, P, V nonpolar
G, A, S, T, Y, W, C, P In-between
R, H, L Positively charged
F, W, Y Aromatic
N, Q, S, T Polar but not charged
D, E Negatively charged
N, E, Q, H, K, R, D Hydrophilic
V, L, I, M, F hydrophobic • Hydrophobic stays inside, while hydrophilic stay close to water • Oppositely charged amino acids can form salt bridge. • Polar amino acids can participate hydrogen bonding