EMBL-EBI MSDfold (SSM) A web service for protein structure comparison and structure searches Eugene...

EMBL-EBI

MSDfold (SSM)

A web service for protein structure comparison and structure searches

Eugene Krissinel

http://www.ebi.ac.uk/msd-srv/ssm/ssmstart.html

EMBL-EBI

Structure alignment

Structure alignment may be defined as identification of residues occupying “equivalent” geometrical positions

Unlike in sequence alignment, residue type is neglected

Used for measuring the structural similarity protein classification and functional analysis database searches

EMBL-EBI

Methods

Many methods are known: Distance matrix alignment (DALI, Holm & Sander, EBI) Vector alignment (VAST, Bryant et. al. NCBI) Depth-first recursive search on SSEs (DEJAVU, Madsen & Kleywegt,

Uppsala) Combinatorial extension (CE, Shindyalov & Bourne, SDSC) Dynamical programming on C (Gerstein & Levitt) Dynamical programming on SSEs (SSA, Singh & Brutlag, Stanford

University) many other

SSM employs a 2-step procedure:A Initial structure alignment and superposition using SSE graph matchingB C - alignment

EMBL-EBI

E. M. Mitchell et al. (1990) J. Mol. Biol. 212:151

L

SSE graphs differ from conventional chemical graphs only in that they are labelled by vectors of properties. In graph matching, the labels are compared with tolerances chosen empirically.

Graph representation of SSEs

EMBL-EBI

SSE graph matchingH1

S1

S2S3

S4

H2

H1

H2 H3

H4

S1

H5

H6

S2

S3

S4 S5

S6

S7

A

B

H1

S1

S2

H2

S3

S4

S5

S6

S7

H3

H4

H5

H6

B

H1

S1

S2

S3

S4

H2

A

Matching the SSE graphs yields a correspondence between secondary structure elements, that is, groups of residues. The correspondence may be used as initial guess for structure superposition and alignment of individual residues.

EMBL-EBI

matched helices matched strands

chain A

chain B

SSE-alignment is used as an initial guess for C-alignment

C-alignment is an iterative procedure based on the expansion of shortest contacts at best superposition of structures

C-alignment is a compromise between the alignment length Nalign and r.m.s.d. Longest contacts are unmapped in order to maximise the Q-score:

BA

align

NNRdsmr

NQ

20

2

....1

C - alignment

EMBL-EBI

More than 2 structures are aligned simultaneously

Multiple alignment is not equal to the set of all-to-all pairwise alignments

Helps to identify common structure motifs for a whole family of structures

Multiple structure alignment

EMBL-EBI

Iterative removal of non-aligning SSEs

best pairwise alignments

A

B

C

Helices may be multiply aligned from pairwise relations

Strands do not multiply align,

but one still can try to align them by probing alternative (not best) alignments

EMBL-EBI

4 alternative pairwise alignments

A

B

C

1

2

2

1

1

make up to 4 multiple alignments:

A1 - B1 - C1

A1 - B2 - C1

A2 - B1 - C1

A2 - B2 - C1

Complexity Ni inO 1 prohibitive for

2015 N structures


EMBL-EBI

Heuristics:

A

B

C

1

2

2

1

1

remove non-aligning SSE with lowest alignment score

Calculate all-to-all pairwise alignments

Are there non-aligning SSEs?

Remove one non-aligning SSE with

lowest score

Quit

Start

Yes No

jiji QQ

and reiterate all alignment


EMBL-EBI

Multiple C refinement

Central star & consensus

A BC

X

Superpose structures and calculate

consensus structure X

Score improved?

Quit

Multiple SSE alignment

Initial C alignment

Choose structure, closest to X, as central star and

align all the rest to

Unmap groups of atoms with highest distance score D in

order to maximise the score

212

02 1 NNRDNQ align

Yes No

EMBL-EBI

Pairwise Alignment vs. Multiple Alignment

Best pairwise alignment of 1SAR:A and 1D1F:B includes only -sheet

Addition of 1MGW:A (close neighbour to 1SAR:A) spots out a common motif of -

sheet and -helix

EMBL-EBI

http://www.ebi.ac.uk/msd-srv/ssm

SSM server map





EMBL-EBI

Table of matched Secondary Structure Elements Table of matched backbone C-atoms with distances

between them at best structure superposition Rotation-translation matrix of best structure

superposition Visualisation in Jmol and Rasmol r.m.s.d. of C-alignment

Length of C-alignment Nalign

Number of gaps in C-alignment

Quality score Q Statistical significance scores P(S), Z Sequence identity

SSM output

EMBL-EBI

P-value is estimated using Q-scores of SSE deviations

P(S) is the probability of getting a score equal to S or higher at random picking structures from the PDB

x1

xi

xn

y0

SSP

Z

22 2ye

SS

P

0

1

SSP P(S) is calibrated on SCOP folds

P(S) is often expressed through Z-score

Statistical significance of alignments

iinR

kx

nnnS 2121

2

20

1

EMBL-EBI

Maximal Q-scored1di2a_ (69 res)Q-score 0.213RMSD 2.43Nalign 67/184P 0.55

Lowest RMSDd1emn_1 (43 res)Q-score 0.019RMSD 0.9Nalign 13/184P 0.075

Highest Nalign

d1elxb_ (449 res)Q-score 0.02RMSD 5.82Nalign 89/184P ~1

Scoring at low structural similarity - 1KNO:A vs SCOP 1.61

EMBL-EBI

Performance data

100

101

102

103

Nqu

erie

s

0 200 400 600 800 1000 1200

10-1

100

101

102

103

104

105

Day from June 17, 2002

CP

U u

sed

[sec

s]total queriesjobs per query

total CPUdelivery time per query CPU/delivery

4

1

50 s

EMBL-EBI

Sequence alignment

Based on residue identity, sometimes with a modified alphabet

--AARNEDDDGKMPSTF-LE-AARNFG-DGK--STFIL

Used for: evolution studies protein function analysis guessing on structure similarity

Algorithms: Dynamic programming + heuristics

Applications: BLAST, FASTA, FLASH and others

Structure alignment

Based on geometrical equivalence of residue positions, residue type disregarded

Used for: protein function analysis some aspects of evolution studies

Algorithms: Dynamic programming, graph theory, MC, geometric hashing and others

Applications: DALI, VAST, CE,MASS, SSM and others

Sequence and Structure Alignments

EMBL-EBI

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Q-score

Seq

. Ide

ntity

0.0 2.0 4.0 6.0RMSD [Å]

0.0 0.2 0.4 0.6 0.8 1.0Nnorm

E. Krissinel & K. Henrick (2004), Acta Cryst. D60, 2256-2268

20% of identical residues are very often sufficient for chains to be structurally similar

212

0

2

1 NNRRmsd

NQ align

21,max NNN

N alignnorm Good structure

similarity

Sequence and Structure Identity

EMBL-EBI

Sequence identity within structure families

Given that A B at 20%, B C at 20%, is A C at 20% or more?

A20%

20% ? 20%

C

B 251

51

51 Naively,

Ok, 20% sequence identity is not a necessary condition for structural similarity. How distant the sequences within a structure family may be?

EMBL-EBI

Sequence identity within structure families: case A

A B C

Aligned residues are structurally conserved through the family. This is a typical assumption for multiple sequence alignment.

Implications: Protein folds are controlled by

certain residue types and/or subsequences.

Protein structure and therefore function are clearly sequence-related

HIS HIS

CYS CYS

TRP TRP

EMBL-EBI

Sequence identity within structure families: case B

Aligned residues are not conserved through the family.

Implications: Protein folds are not controlled

by any particular residue types and/or subsequences.

Many different sequences may fold into similar structures

Protein structure and therefore function are not clearly sequence-relatedA B C

HIS HIS

CYS CYS

TRP TRP

EMBL-EBI

A B C

This case may be identified by multiple structure alignment only.

Multiple sequence alignment will always find and superpose short fragments:

HIS HIS

CYS CYS

TRP TRP

-----AFRNEDDDGGKPSTFKLEAARNAF-------GKKSTFILEAARNAFDGKMTBIGK------

Sequence identity within structure families: case B

EMBL-EBI

Multiple alignment of SCOP folds

SCOP database

11 classes

945 folds

1539 superfamilies

2845 families

70859 domains

SCOP

Structure-related hierarchy Manually curated

Multiple structure alignment of domains in SCOP folds

Sound structure resemblance within folds Wide sequence variations Sequence redundancy cut-off at 50%

EMBL-EBI

0 10 20 30 40 50

2

4

6

8

10

Sequence Identity, %

Pro

babi

lity

dens

itySequence identity in SCOP folds

Average multiple sequence identity (A) 12%Average pairwise sequence identity (B) 19%

pairwise sequence conservation (case B)

multiple sequence conservation (case A)

case A

case B

EMBL-EBI

0

4

8

12

16

20

24

Odd

s

ARG

LYS

GLN

GLU

ASN

ASP

HIS

PRO

SER

THR

GLY

ALA

CYS

TYR

TRP

MET

PHE

LEU

VAL

ILE

Residue conservation

Odds are calculated as a ratio of observed and expected probabilities to obtain identity residue substitutions: Henikoff, S. and Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. 89, p. 10915.

EMBL-EBI

0

4

8

12

16

20

24

Odd

s

ARG

LYS

GLN

GLU

ASN

ASP

HIS

PRO

SER

THR

GLY

ALA

CYS

TYR

TRP

MET

PHE

LEU

VAL

ILE

TRP

PRO

LYS

GLU

HIS

Reference data from Naor D. et.al. (1996). J. Mol. Biol. 256, p. 924.

Residue conservation

EMBL-EBI

Log odds matrix for SCOP folds

ARG 3 1 0 0 0 0 0 0 0 0 -1 -1 -1 0 -1 -1 -1 -1 -1 -1LYS 1 3 1 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1GLN 0 1 3 0 0 0 0 0 0 0 -1 0 -1 0 0 0 -1 -1 -1 -1GLU 0 0 0 2 0 0 0 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1ASN 0 0 0 0 3 0 0 -1 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1ASP 0 0 0 0 0 3 0 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -2 -2 -2HIS 0 0 0 0 0 0 4 -1 0 0 -1 -1 -1 0 0 -1 -1 -1 -1 -1PRO 0 0 0 -1 -1 -1 -1 4 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1SER 0 0 0 0 0 0 0 0 2 0 0 0 0 0 -1 0 -1 -1 -1 -1THR 0 0 0 0 0 0 0 0 0 2 -1 0 0 0 -1 0 -1 -1 0 -1GLY -1 -1 -1 -1 0 -1 -1 -1 0 -1 3 0 -1 -1 -1 -1 -1 -2 -1 -1ALA -1 -1 0 -1 -1 -1 -1 -1 0 0 0 2 0 -1 -1 0 -1 -1 0 -1CYS -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 0 4 0 -1 0 0 0 0 0TYR 0 -1 0 -1 -1 -1 0 -1 0 0 -1 -1 0 3 1 0 1 0 0 0TRP -1 -1 0 -1 -1 -1 0 -1 -1 -1 -1 -1 -1 1 5 0 0 -1 -1 -1MET -1 -1 0 -1 -1 -1 -1 -1 0 0 -1 0 0 0 0 3 0 1 0 0PHE -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 0 0 3 0 0 0LEU -1 -1 -1 -1 -1 -2 -1 -1 -1 -1 -2 -1 0 0 -1 1 0 2 0 0VAL -1 -1 -1 -1 -1 -2 -1 -1 -1 0 -1 0 0 0 -1 0 0 0 2 1ILE -1 -1 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 0 0 -1 0 0 0 1 2

ARG LYS GLN GLU ASN ASP HIS PRO SER THR GLY ALA CYS TYR TRP MET PHE LEU VAL ILE-4.5 -3.9 -3.5 -3.5 -3.5 -3.5 -3.2 -1.6 -0.8 -0.7 -0.4 1.8 2.5 -1.3 -0.9 1.9 2.8 3.8 4.2 4.5

hydrophilic hydrophobic

Hydropathy index by Kyte, J. and Doolittle, R. F. (1982). J. Mol. Biol. 157, p. 105.

EMBL-EBI

0 20 40 60 80 100

0

2

4

6

8

10

Identity, %

Pro

babi

lity

dens

itySequence vs “hydropathy” identity in SCOP folds

Average pairwise sequence identity 19%Average multiple sequence identity 12%Average “hydropathy” identity 68%

hydropathy conservation

pairwise sequence conservation (case B)

multiple sequence conservation (case A)

case A

case B

EMBL-EBI

What is 20% sequence identity?

Consider an idealized model, where all residues are indiscriminately substituted by like-hydropathic residues only :

10

1110

11

0

0

aF̂Count matrix10 hydrophilic residues

10 hydrophobic residues

aaN 1102110102 Total counts (in upper triangle)

%1811020

aa

NN

SI diagExpected sequence identity

EMBL-EBI

Conclusion

it is quite possible that residue identity plays a much less significant role in protein structure than often believed

as a consequence, the role of residue identity in protein function may be often overestimated

using sequence identity for the assessment of structural or functional features may give more false negatives than expected

physical-chemical properties of residues should be given preference over residue identity in structure and function analysis

modern methods for structure alignment are efficient; there is little sense to use sequence alignment in structure-related studies

Acknowledgement. This work has been supported by research grant No. 721/B19544 from the Biotechnology and Biological Sciences Research Council (BBSRC) UK.

EMBL-EBI MSDfold (SSM) A web service for protein structure comparison and structure searches Eugene...

Documents

Transcript of EMBL-EBI MSDfold (SSM) A web service for protein structure comparison and structure searches Eugene...