COMP 564: Protein Secondary Structure Prediction › ~jeromew › teaching › 564 › W... ·...

COMP 564: Protein Secondary Structure Prediction

Jérôme WaldispühlSchool of Computer Science

McGill University

Protein Secondary Structure

Protein Secondary Structure Prediction Using Statistical Models

•  Sequences determine structures

•  Proteins fold into minimum energy state.

•  Structures are more conserved than sequences. Two proteins with 30% identity likely share the same fold.

How to evaluate a prediction?

correctly predicted residues number of residues

In 2D: The Q3 test.

=3Q

In 3D: The Root Mean Square Deviation (RMSD)

•  First generation – single residue statistics Fasman & Chou (1974) : Some residues have particular secondary structure preference. Examples: Glu α-Helix Val β-strand

Old methods

C  . 1 3 1 3 33 35 , 5 3. 3 1 .4 3.

Difficulties

Bad accuracy - below 66% (Q3 results).

Q3 of strands (E) : 28% - 48%.

Predicted structures were too short.

Methods Accuracy Comparison

3rd generation methods

•  Third generation methods reached 77% accuracy.

•  They consist of two new ideas: 1. A biological idea – Using evolutionary information. 2. A technological idea – Using neural networks.

How can evolutionary information help us?

Homologues similar structure

But sequences change up to 85%

Sequence would vary differently - depends on structure


In defined secondary structures.

In protein core�s segments (more hydrophobic). In amphipatic helices (cycle of hydrophobic and hydrophilic residues).

Where can we find high sequence conservation?

Some examples:


•  Predictions based on multiple alignments were made manually.

Problem: •  There isn�t any well defined algorithm!

Solution: •  Use Neural Networks .

Artificial Neural Network

The neural network basic structure : •  Big amount of processors – �neurons�. •  Highly connected. •  Working together.


What does a neuron do? •  Gets �signals� from its neighbors.

•  When achieving certain threshold - sends signals. •  Each signal has different weight.

1s

2s

3s W3

W1

W2

Artificial Neural Network General structure of ANN :

•  One input layer.

•  Some hidden layers.

•  One output layer.

•  Our ANN have one-direction flow !


Neural network

Test set

Training set

Correct

Incorrect

Network training and testing :

Back - propagation

•  Training set - inputs for which we know the wanted output. •  Back propagation - algorithm for changing neurons pulses �power�. •  Test set - inputs used for final network performance test.

Artificial Neural Network The Network is a �black box�: •  Even when it succeeds it�s hard to understand how. •  It�s difficult to conclude an algorithm from the network. •  It�s hard to deduce new scientific principles.

Structure of 3rd generation methods

Find homologues using large data bases.

Create a profile representing the entire protein family.

Give sequence and profile to ANN.

Output of the ANN: 2nd structure prediction.

Structure of 3rd generation methods

The ANN learning process: Training & testing set: - Proteins with known sequence & structure. Training:

- Insert training set to ANN as input. - Compare output to known structure. - Back propagation.

3rd generation methods - difficulties

Main problem - unwise selection of training & test sets for ANN.

•  First problem – unbalanced training

Overall protein composition: •  Helices - 32% •  Strands - 21% •  Coils – 47%

What will happen if we train the ANN with random segments ?

3rd generation methods - difficulties

•  Second problem – unwise separation between training & test proteins

What will happen if homology / correlation exists between test & training proteins?

Above 80% accuracy in testing. over optimism!

•  Third problem – similarity between test proteins.

Protein Secondary Structure Prediction Based on Position – specific Scoring Matrices

David T. Jones

PSI - PRED : 3RD generation method based on the iterated PSI – BLAST algorithm.

PSI - BLAST

Sequence

Distant homologues

PSSM - position specific scoring matrix

•  PSI – BLAST finds distant homologues. (It exists now alternatives such as HMMER 3.0 or HHblits) •  PSSM – input for PSI - PRED.

PSI - PRED ANN�s architecture:

1ST ANN

2ND ANN

•  Two ANNs working together.

Final prediction

Sequence + PSSM

Prediction

PSI - PRED Step 1: •  Create PSSM from sequence - 3 iterations of PSI – BLAST.

Step 2: 1ST ANN •  Sequence + PSSM 1st ANN�s input.

A D C Q E I L H T S T T W Y V 15 RESIDUES

output: central amino acid secondary state prediction. A D C Q E I L H T S T T W Y V

E/H/C

PSI - PRED Using PSI - BLAST brings up PSI – BLAST difficulties:

Iteration - extension of proteins family

Updating PSSM

Inclusion of non – homologues

�Misleading� PSSM

PSI - PRED Step 3: 2nd ANN •  So why do we need a second ANN ? possible output for 1st ANN:

A A P P L L L L M M M G I M M R R I M E E E E E C C C C C H C C C C C E E E

what�s wrong with that ?

seq pred

one-amino-acid helix doesn�t exist

Solution: ANN that �looks� at the whole context !

Input: output of 1st ANN. Output: final prediction.

PSI - PRED Training :

Testing : •  187 proteins, Highly resolved structure.

•  Without structural similarities.

•  PSI – BLAST was used for removing homologues.

Balanced training.

PSI - PRED Jones�s reported results : Q3 results : 76% - 77%

PSI - PRED Reliability numbers:

•  Used by many methods.

•  Correlates with accuracy.

•  The way the ANN tells us how much it is sure about the assignment.

Performance Evaluation

•  Many 3rd generation methods exist today.

Which method is the best one ? How to recognize �over-optimism� ?

•  Through 3rd generation methods accuracy jumped ~10%.


Conclusion : PSI-PRED seams to be one of the most reliable method today.

Reasons :

•  Strict training & testing criterions for ANN.

•  The widest evolutionary information (PSI - BLAST profiles).

Improvements

3rd generation methods best results: ~77% in Q3 .

The first 3rd generation method PHD: ~72% in Q3.

Sources of improvement :

•  Larger protein data bases. •  PSI – BLAST PSI – PRED broke through, many followed...

Improvements How can we do better than that ?

•  Combination of methods.

Through larger data bases (?).

Example: Combining 4 best methods Q3 of ~78% !

•  Find why certain proteins predicted poorly.

Bibliography •  Jones DT. Protein secondary structure prediction

based on position specific scoring matrices. J Mol Biol. 1999 292:195-202

•  Rost B. Rising accuracy of protein secondary structure prediction 'Protein structure determination, analysis, and modeling for drug discovery� (ed. D Chasman), New York: Dekker, pp. 207-249

Residue Interaction Graph

•  Each residue as a vertex

•  Two residues interact if there is a potential clash between their rotamer atoms

•  Add one edge between two residues that interact.

Residue Interaction Graph

a

b

c

d f

e

m

l k j

i

h

s

Key Observations •  A residue interaction graph is a geometric

neighborhood graph –  Each rotamer is bounded to its backbone position by a

constant distance –  There is no interaction edge between two residues if their

distance is beyond D. D is a constant depending on rotamer diameter.

•  Residue interaction graphs are sparse! –  Any two residue centers cannot be too close. Their distance

is at least a constant C.

No previous algorithms exploit these features!

Tree Decomposition [Robertson & Seymour, 1986]

h

Greedy: minimum degree heuristic

a

b

c

d f

e

m

l k j

i

g

a c

d f

e

m

k j

i

h

g abd

l

  2 2 3 2 3 3 5 . 1   2 2 . 3 312,   .. .1 A 312, 2 2   2 2   2 , 35 2 1 2 3 A

Tree Decomposition (Cont�d)

Tree Decomposition

Tree width is the maximal component size minus 1.

a

b

c

d f

e

m

l k j

i

h

g abd acd

clk

cdem defm

fgh

eij

ab ac

clk

c f

fgh

ij

remove dem

Side-Chain Packing Algorithm 1.  Bottom-to-Top: Calculate the

minimal energy function

2. Top-to-Bottom: Extract the optimal assignment

3. Time complexity: exponential to tree width, linear to graph size

{ }))(,())(,())(,())(,( min)A(

iililjijXX

iri XAXScoreXAXFXAXFXAXFri

++=−

The score of subtree rooted at Xi

The score of component Xi

The scores of subtree rooted at Xj

Xr

Xp Xi

Xj Xl Xq

Xir

Xji Xli

A tree decomposition rooted at Xr

The scores of subtree rooted at Xl

•  For a general graph, it is NP-hard to determine its optimal treewidth.

•  Has a treewidth –  Can be found within a low-degree polynomial-time

algorithm, based on Sphere Separator Theorem [G.L. Miller et al., 1997], a generalization of the Planar Separator Theorem

•  Has a treewidth lower bound –  The residue interaction graph is a cube –  Each residue is a grid point

Theoretical Treewidth Bounds

•  Has a PTAS if one of the following conditions is satisfied: –  All the energy items are non-positive –  All the pairwise energy items have the same sign, and the lowest

system energy is away from 0 by a certain amount

Result (2) An optimization problem admits a PTAS if given an error ε (0<ε<1), there is a polynomial-time algorithm to obtain a solution close to the optimal within a factor of (1±ε).

Chazelle et al. have proved that it is NP-complete to approximate this problem within a factor of O(N), without considering the geometric characteristics of a protein structure.

Linear & Integer Program

•  Linear programs can be solved within polynomial time

•  No polynomial time for integer programs so far – Relaxed to linear program, solve the linear

version – Branch-and-bound or branch-and-cut (may cost

exponential time)

Protein Threading

•  Make a structure prediction through finding an optimal alignment (placement) of a protein sequence onto each known structure (structural template) –  “alignment” quality is measured by some statistics-based scoring

function –  best overall “alignment” among all templates may give a structure

prediction

target sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE template library

Threading Model

•  Each template is parsed as a chain of cores. Two adjacent cores are connected by a loop. Cores are the most conserved segments in a protein.

•  No gap allowed within a core. •  Only the pairwise contact between two core

residues are considered because contacts involved with loop residues are not conserved well.

•  Global alignment employed

CASP5/CAFASP3 targets

30 12 20 # targets

FR HM hard HM easy CAFASP3

NF NF/FR FR(A) FR(H) CM/FR CM CASP5

Prediction Difficulty

CM: Comparative Modelling, HM: Homology Modelling FR: Fold Recogniton, NF: New Fold

Hard Easy

RAPTOR�s sensitivity on FR targets

5 5 6 7 7 # Targets 0 1 2 4 6 # Correct

NF NF/FR FR(A) FR(H) CM/FR

1. RAPTOR is weak at recognizing FR(A) targets (need improvement ) 2. RAPTOR could not deal with NF targets at all (normal)

Support Vector Machine (SVM) Regression (A.J. Smola et al)

),...,,( 21 dxxxx = ),...,( 21 daaaa =),...,,( 21 mbbbb =),...,,( 21 myyyy =Notation

Linear regression ε+= axf

If the relationship between f and x is not linear )(xy φ=

SVM regression: linear regression in a high-dimension space 'ε+= byf

dm >>

),()(),(, 212121 xxKxxyy >=>=<< φφ (1)

Data ),(),...,,(),,( 2211 nn fxfxfx

Condition:

Side Chain Properties

The amino acids names are colored according to their type: positively charged, negatively charged, polar but not charged, aliphatic (nonpolar), and aromatic. Amino acids that are essential to mammals are marked with an asterisk (*).

A, G, I, L, M, P, V nonpolar

G, A, S, T, Y, W, C, P In-between

R, H, L Positively charged

F, W, Y Aromatic

N, Q, S, T Polar but not charged

D, E Negatively charged

N, E, Q, H, K, R, D Hydrophilic

V, L, I, M, F hydrophobic • Hydrophobic stays inside, while hydrophilic stay close to water • Oppositely charged amino acids can form salt bridge. • Polar amino acids can participate hydrogen bonding

COMP 564: Protein Secondary Structure Prediction › ~jeromew › teaching › 564 › W... ·...

Documents

Transcript of COMP 564: Protein Secondary Structure Prediction › ~jeromew › teaching › 564 › W... ·...