Eugene Krissinel CCP4, STFC Research Complex at Harwell Didcot , United Kingdom
EMBL-EBI MSDfold (SSM) A web service for protein structure comparison and structure searches Eugene...
-
Upload
moses-adams -
Category
Documents
-
view
223 -
download
0
description
Transcript of EMBL-EBI MSDfold (SSM) A web service for protein structure comparison and structure searches Eugene...
EMBL-EBI
MSDfold (SSM)
A web service for protein structure comparison and structure searches
Eugene Krissinel
http://www.ebi.ac.uk/msd-srv/ssm/ssmstart.html
EMBL-EBI
Structure alignment
Structure alignment may be defined as identification of residues occupying “equivalent” geometrical positions
Unlike in sequence alignment, residue type is neglected
Used for measuring the structural similarity protein classification and functional analysis database searches
EMBL-EBI
Methods
Many methods are known: Distance matrix alignment (DALI, Holm & Sander, EBI) Vector alignment (VAST, Bryant et. al. NCBI) Depth-first recursive search on SSEs (DEJAVU, Madsen & Kleywegt,
Uppsala) Combinatorial extension (CE, Shindyalov & Bourne, SDSC) Dynamical programming on C (Gerstein & Levitt) Dynamical programming on SSEs (SSA, Singh & Brutlag, Stanford
University) many other
SSM employs a 2-step procedure:A Initial structure alignment and superposition using SSE graph matchingB C - alignment
EMBL-EBI
E. M. Mitchell et al. (1990) J. Mol. Biol. 212:151
L
SSE graphs differ from conventional chemical graphs only in that they are labelled by vectors of properties. In graph matching, the labels are compared with tolerances chosen empirically.
Graph representation of SSEs
EMBL-EBI
SSE graph matchingH1
S1
S2S3
S4
H2
H1
H2 H3
H4
S1
H5
H6
S2
S3
S4 S5
S6
S7
A
B
H1
S1
S2
H2
S3
S4
S5
S6
S7
H3
H4
H5
H6
B
H1
S1
S2
S3
S4
H2
A
Matching the SSE graphs yields a correspondence between secondary structure elements, that is, groups of residues. The correspondence may be used as initial guess for structure superposition and alignment of individual residues.
EMBL-EBI
matched helices matched strands
chain A
chain B
SSE-alignment is used as an initial guess for C-alignment
C-alignment is an iterative procedure based on the expansion of shortest contacts at best superposition of structures
C-alignment is a compromise between the alignment length Nalign and r.m.s.d. Longest contacts are unmapped in order to maximise the Q-score:
BA
align
NNRdsmr
NQ
20
2
....1
C - alignment
EMBL-EBI
More than 2 structures are aligned simultaneously
Multiple alignment is not equal to the set of all-to-all pairwise alignments
Helps to identify common structure motifs for a whole family of structures
Multiple structure alignment
EMBL-EBI
Iterative removal of non-aligning SSEs
best pairwise alignments
A
B
C
Helices may be multiply aligned from pairwise relations
Strands do not multiply align,
but one still can try to align them by probing alternative (not best) alignments
EMBL-EBI
4 alternative pairwise alignments
A
B
C
1
2
2
1
1
make up to 4 multiple alignments:
A1 - B1 - C1
A1 - B2 - C1
A2 - B1 - C1
A2 - B2 - C1
Complexity Ni inO 1 prohibitive for
2015 N structures
Iterative removal of non-aligning SSEs
EMBL-EBI
Heuristics:
A
B
C
1
2
2
1
1
remove non-aligning SSE with lowest alignment score
Calculate all-to-all pairwise alignments
Are there non-aligning SSEs?
Remove one non-aligning SSE with
lowest score
Quit
Start
Yes No
jiji QQ
and reiterate all alignment
Iterative removal of non-aligning SSEs
EMBL-EBI
Multiple C refinement
Central star & consensus
A BC
X
Superpose structures and calculate
consensus structure X
Score improved?
Quit
Multiple SSE alignment
Initial C alignment
Choose structure, closest to X, as central star and
align all the rest to
Unmap groups of atoms with highest distance score D in
order to maximise the score
212
02 1 NNRDNQ align
Yes No
EMBL-EBI
Pairwise Alignment vs. Multiple Alignment
Best pairwise alignment of 1SAR:A and 1D1F:B includes only -sheet
Addition of 1MGW:A (close neighbour to 1SAR:A) spots out a common motif of -
sheet and -helix
EMBL-EBI
http://www.ebi.ac.uk/msd-srv/ssm
SSM server map
EMBL-EBI
Table of matched Secondary Structure Elements Table of matched backbone C-atoms with distances
between them at best structure superposition Rotation-translation matrix of best structure
superposition Visualisation in Jmol and Rasmol r.m.s.d. of C-alignment
Length of C-alignment Nalign
Number of gaps in C-alignment
Quality score Q Statistical significance scores P(S), Z Sequence identity
SSM output
EMBL-EBI
P-value is estimated using Q-scores of SSE deviations
P(S) is the probability of getting a score equal to S or higher at random picking structures from the PDB
x1
xi
xn
y0
SSP
Z
22 2ye
SS
P
0
1
SSP P(S) is calibrated on SCOP folds
P(S) is often expressed through Z-score
Statistical significance of alignments
iinR
kx
nnnS 2121
2
20
1
EMBL-EBI
Maximal Q-scored1di2a_ (69 res)Q-score 0.213RMSD 2.43Nalign 67/184P 0.55
Lowest RMSDd1emn_1 (43 res)Q-score 0.019RMSD 0.9Nalign 13/184P 0.075
Highest Nalign
d1elxb_ (449 res)Q-score 0.02RMSD 5.82Nalign 89/184P ~1
Scoring at low structural similarity - 1KNO:A vs SCOP 1.61
EMBL-EBI
Performance data
100
101
102
103
Nqu
erie
s
0 200 400 600 800 1000 1200
10-1
100
101
102
103
104
105
Day from June 17, 2002
CP
U u
sed
[sec
s]total queriesjobs per query
total CPUdelivery time per query CPU/delivery
4
1
50 s
EMBL-EBI
Sequence alignment
Based on residue identity, sometimes with a modified alphabet
--AARNEDDDGKMPSTF-LE-AARNFG-DGK--STFIL
Used for: evolution studies protein function analysis guessing on structure similarity
Algorithms: Dynamic programming + heuristics
Applications: BLAST, FASTA, FLASH and others
Structure alignment
Based on geometrical equivalence of residue positions, residue type disregarded
Used for: protein function analysis some aspects of evolution studies
Algorithms: Dynamic programming, graph theory, MC, geometric hashing and others
Applications: DALI, VAST, CE,MASS, SSM and others
Sequence and Structure Alignments
EMBL-EBI
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Q-score
Seq
. Ide
ntity
0.0 2.0 4.0 6.0RMSD [Å]
0.0 0.2 0.4 0.6 0.8 1.0Nnorm
E. Krissinel & K. Henrick (2004), Acta Cryst. D60, 2256-2268
20% of identical residues are very often sufficient for chains to be structurally similar
212
0
2
1 NNRRmsd
NQ align
21,max NNN
N alignnorm Good structure
similarity
Sequence and Structure Identity
EMBL-EBI
Sequence identity within structure families
Given that A B at 20%, B C at 20%, is A C at 20% or more?
A20%
20% ? 20%
C
B 251
51
51 Naively,
Ok, 20% sequence identity is not a necessary condition for structural similarity. How distant the sequences within a structure family may be?
EMBL-EBI
Sequence identity within structure families: case A
A B C
Aligned residues are structurally conserved through the family. This is a typical assumption for multiple sequence alignment.
Implications: Protein folds are controlled by
certain residue types and/or subsequences.
Protein structure and therefore function are clearly sequence-related
HIS HIS
CYS CYS
TRP TRP
EMBL-EBI
Sequence identity within structure families: case B
Aligned residues are not conserved through the family.
Implications: Protein folds are not controlled
by any particular residue types and/or subsequences.
Many different sequences may fold into similar structures
Protein structure and therefore function are not clearly sequence-relatedA B C
HIS HIS
CYS CYS
TRP TRP
EMBL-EBI
A B C
This case may be identified by multiple structure alignment only.
Multiple sequence alignment will always find and superpose short fragments:
HIS HIS
CYS CYS
TRP TRP
-----AFRNEDDDGGKPSTFKLEAARNAF-------GKKSTFILEAARNAFDGKMTBIGK------
Sequence identity within structure families: case B
EMBL-EBI
Multiple alignment of SCOP folds
SCOP database
11 classes
945 folds
1539 superfamilies
2845 families
70859 domains
SCOP
Structure-related hierarchy Manually curated
Multiple structure alignment of domains in SCOP folds
Sound structure resemblance within folds Wide sequence variations Sequence redundancy cut-off at 50%
EMBL-EBI
0 10 20 30 40 50
2
4
6
8
10
Sequence Identity, %
Pro
babi
lity
dens
itySequence identity in SCOP folds
Average multiple sequence identity (A) 12%Average pairwise sequence identity (B) 19%
pairwise sequence conservation (case B)
multiple sequence conservation (case A)
case A
case B
EMBL-EBI
0
4
8
12
16
20
24
Odd
s
ARG
LYS
GLN
GLU
ASN
ASP
HIS
PRO
SER
THR
GLY
ALA
CYS
TYR
TRP
MET
PHE
LEU
VAL
ILE
Residue conservation
Odds are calculated as a ratio of observed and expected probabilities to obtain identity residue substitutions: Henikoff, S. and Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. 89, p. 10915.
EMBL-EBI
0
4
8
12
16
20
24
Odd
s
ARG
LYS
GLN
GLU
ASN
ASP
HIS
PRO
SER
THR
GLY
ALA
CYS
TYR
TRP
MET
PHE
LEU
VAL
ILE
TRP
PRO
LYS
GLU
HIS
Reference data from Naor D. et.al. (1996). J. Mol. Biol. 256, p. 924.
Residue conservation
EMBL-EBI
Log odds matrix for SCOP folds
ARG 3 1 0 0 0 0 0 0 0 0 -1 -1 -1 0 -1 -1 -1 -1 -1 -1LYS 1 3 1 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1GLN 0 1 3 0 0 0 0 0 0 0 -1 0 -1 0 0 0 -1 -1 -1 -1GLU 0 0 0 2 0 0 0 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1ASN 0 0 0 0 3 0 0 -1 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1ASP 0 0 0 0 0 3 0 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -2 -2 -2HIS 0 0 0 0 0 0 4 -1 0 0 -1 -1 -1 0 0 -1 -1 -1 -1 -1PRO 0 0 0 -1 -1 -1 -1 4 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1SER 0 0 0 0 0 0 0 0 2 0 0 0 0 0 -1 0 -1 -1 -1 -1THR 0 0 0 0 0 0 0 0 0 2 -1 0 0 0 -1 0 -1 -1 0 -1GLY -1 -1 -1 -1 0 -1 -1 -1 0 -1 3 0 -1 -1 -1 -1 -1 -2 -1 -1ALA -1 -1 0 -1 -1 -1 -1 -1 0 0 0 2 0 -1 -1 0 -1 -1 0 -1CYS -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 0 4 0 -1 0 0 0 0 0TYR 0 -1 0 -1 -1 -1 0 -1 0 0 -1 -1 0 3 1 0 1 0 0 0TRP -1 -1 0 -1 -1 -1 0 -1 -1 -1 -1 -1 -1 1 5 0 0 -1 -1 -1MET -1 -1 0 -1 -1 -1 -1 -1 0 0 -1 0 0 0 0 3 0 1 0 0PHE -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 0 0 3 0 0 0LEU -1 -1 -1 -1 -1 -2 -1 -1 -1 -1 -2 -1 0 0 -1 1 0 2 0 0VAL -1 -1 -1 -1 -1 -2 -1 -1 -1 0 -1 0 0 0 -1 0 0 0 2 1ILE -1 -1 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 0 0 -1 0 0 0 1 2
ARG LYS GLN GLU ASN ASP HIS PRO SER THR GLY ALA CYS TYR TRP MET PHE LEU VAL ILE-4.5 -3.9 -3.5 -3.5 -3.5 -3.5 -3.2 -1.6 -0.8 -0.7 -0.4 1.8 2.5 -1.3 -0.9 1.9 2.8 3.8 4.2 4.5
hydrophilic hydrophobic
Hydropathy index by Kyte, J. and Doolittle, R. F. (1982). J. Mol. Biol. 157, p. 105.
EMBL-EBI
0 20 40 60 80 100
0
2
4
6
8
10
Identity, %
Pro
babi
lity
dens
itySequence vs “hydropathy” identity in SCOP folds
Average pairwise sequence identity 19%Average multiple sequence identity 12%Average “hydropathy” identity 68%
hydropathy conservation
pairwise sequence conservation (case B)
multiple sequence conservation (case A)
case A
case B
EMBL-EBI
What is 20% sequence identity?
Consider an idealized model, where all residues are indiscriminately substituted by like-hydropathic residues only :
10
1110
11
0
0
aF̂Count matrix10 hydrophilic residues
10 hydrophobic residues
aaN 1102110102 Total counts (in upper triangle)
%1811020
aa
NN
SI diagExpected sequence identity
EMBL-EBI
Conclusion
it is quite possible that residue identity plays a much less significant role in protein structure than often believed
as a consequence, the role of residue identity in protein function may be often overestimated
using sequence identity for the assessment of structural or functional features may give more false negatives than expected
physical-chemical properties of residues should be given preference over residue identity in structure and function analysis
modern methods for structure alignment are efficient; there is little sense to use sequence alignment in structure-related studies
Acknowledgement. This work has been supported by research grant No. 721/B19544 from the Biotechnology and Biological Sciences Research Council (BBSRC) UK.