Methods for comparing protein structures Methods for comparing protein structures Protein structural...

Post on 17-Jan-2016

236 views 0 download

Tags:

Transcript of Methods for comparing protein structures Methods for comparing protein structures Protein structural...

Methods for comparing protein structuresMethods for comparing protein structures

Protein structural classificationsProtein structural classifications

How do structures and functions diverge in protein How do structures and functions diverge in protein superfamiliessuperfamilies

What proportion of genome sequences can be What proportion of genome sequences can be predicted to belong to superfamilies of known predicted to belong to superfamilies of known structure?structure?

Comparing and Classifying Domain StructuresComparing and Classifying Domain Structures

Protein Domain Family ClassificationsProtein Domain Family Classifications

Known domain structuresAlexey Murzin, LMB, Cambridge

Predicted domain structures Julian Gough, Bristol University

Known domain structuresPredicted domain structuresChristine Orengo, UCL

Domain sequencesAlex Bateman, Sanger

60-80% of genes in genomes code for multidomain proteins

domains are important evolutionary units

Th. thermophilus

human

human

yeast

M. tuberculosis

Domain Superfamily

Evolution gives rise to families of proteins Evolution gives rise to families of proteins (homologues)(homologues)

structure is more highly conserved than sequence during evolutionstructure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conservedAt least 40-50% of the structure is conserved

Th. thermophilus

human

human

yeast

M. tuberculosis

Domain Superfamily

structure is more highly conserved than sequence during evolutionstructure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conservedAt least 40-50% of the structure is conserved

orthologuesorthologues

Evolution gives rise to families of proteins Evolution gives rise to families of proteins (homologues)(homologues)

Th. thermophilus

human

human

yeast

M. tuberculosis

Domain Superfamily

structure is more highly conserved than sequence during evolutionstructure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conservedAt least 40-50% of the structure is conserved

paraloguesparalogues

Evolution gives rise to families of proteinsEvolution gives rise to families of proteins

Structural diversity in the CATH Domain Family P-loop hydrolases

Cutinase Cocaine esterase Acetylcholinesterase

structure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conserved

Challenges in comparing protein structuresChallenges in comparing protein structures

residue substitutions due to single base mutationsresidue substitutions due to single base mutations

insertions or deletions (indels) of residues - usually insertions or deletions (indels) of residues - usually not in the secondary structures but in the not in the secondary structures but in the connecting loopsconnecting loops

Usually the structural cores are highly conserved Usually the structural cores are highly conserved

Although structure is much more conserved than Although structure is much more conserved than the sequence there can still be considerable the sequence there can still be considerable structural differences between relatives outside structural differences between relatives outside the corethe core

• residue insertions usually occur in the loops connecting secondary structures• substitutions can cause shifts in the orientations of secondary structures

Superposition of OB fold Structures

Related structuresRMSD usually < 3.5A

Coping with Insertions and Deletions

ignore the variable loop regions and only ignore the variable loop regions and only compare the secondary structurescompare the secondary structures

use algorithms which can explicitly handle use algorithms which can explicitly handle insertions/deletions e.g. dynamic insertions/deletions e.g. dynamic programming, simulated annealingprogramming, simulated annealing

Fast structure comparison by secondary structures

H

H

H

E E E

H

H

H

H

E E E

H H

E

H

Graphs can be compared using the Bron Kerbosch algorithm to find the largest common graph

In this example the common graph contains 5 nodes.

Generallly ~1000 times faster than residue based methods

Score distances between superposed residues in path matrix

Use equivalences given by the best path to re-superpose the structures

Use dynamic programming to find best path

Superpose structures

STRUCTALSTRUCTAL

Align sequences

Structure Comparison Algorithms

Secondary structure based:Secondary structure based:

SSMSSM HenrickHenrick PDBPDB

GRATHGRATH Harrison & OrengoHarrison & Orengo CATHCATH

Residue based:Residue based:

SSAP SSAP Taylor and Orengo Taylor and Orengo CATH CATH

DALI DALI Holm and Sander Holm and Sander SCOPSCOP

Comparer Comparer Sali and Blundell Sali and Blundell HOMSTRADHOMSTRAD

FatCat FatCat Adam Godzik Adam Godzik PDBPDB

Structal Structal Levitt Levitt PDB PDB

Structural Bioinformatics, Ed: Phil Bourne, Wiley 2003Bioinformatics: Genes, Proteins and Computers, Bios, 2003

Structure classification

2600 domain superfamilies

~200,000 domains

Domain structure database

AATT

HH

lasslassrchitecturerchitecture

opology or Fold Groupopology or Fold Group

omologous Superfamilyomologous Superfamily

Orengo & Thornton 1993CC

Class

Architecture

Topology or Fold

3

~40

~1200

domain database~200,000 domainsCATH

CATH Architectures

Orthogonal bundle

Up-down bundle

-horseshoe

-solenoid -barrel -ribbon

-sheet -roll -barrel

Clam 2-layer -sandwich

Trefoil

3-layer -sandwich

-propeller -solenoid

Orthogonal -prism Parallel -prism

-roll

CATH Architectures

2-layer () sandwich-barrel 3-layer () sandwich

3-layer () sandwich 3-layer () sandwich

4-layer () sandwich

-prism -horseshoe -box

CATH Architectures

Topology orFold Group

~1200

HomologousSuperfamily

~2600

SequenceFamily (30%)

40,000 domain entries

~200,000 domain entries

CC AATT HH

Divergent Evolution

Convergent Evolution

..VILST… ..KLST… ...SLTRF...

..VILST… ..KLST… ...SLTRF...

Divergent Evolution

Convergent Evolution

Homologous Structures

cholera toxin pertussis toxin

Heat labile enterotoxin

97

79%

81

12%

SSAP score

Sequence identity

• high structure similarity score, often < 4A• may have detectable sequence similarity e.g. by HMMs• related functions

structural similarity no sequence similarity no functional similarity

Evolutionary Ancestry Uncertain

How do proteins evolve new How do proteins evolve new functions?functions?

Evolution of Protein Functions in Domain SuperfamiliesEvolution of Protein Functions in Domain Superfamilies

domain duplication

domain fusion, change in domain partner

residue mutations and domain structure embellishments

oligomerisation

Mutation of ResiduesTIM barrel glycosyl hydrolases

chitinase AGlu general acid

narboninGlu incorporated in a

salt-bridge and this blockssubstrate access

acid

changes in the domain structure can modify the binding siteor domain surface

2.7.7.392.7.7.3

binding site

Pantetheine-phosphate adenyltransferase

Glycerol-3-phosphate cytidylyl transferase

EC code:

binding site

Changes in domain function in paralogous Changes in domain function in paralogous relativesrelatives

1od6A00

1f7uA01

binding site

Arginyl-tRNA synthetase

Pantetheine-phosphate

adenyltransferase

Arginyl-tRNA synthetase

Asparagine synthetase B

changes in the domain partnerships can changes in the domain partnerships can modify the binding sitemodify the binding site

binding site

Pantetheine-phosphate Pantetheine-phosphate adenyltransferaseadenyltransferase

Change in OligomerisationChange in Oligomerisation

calsequestrin

peroxidase

Thioredoxin superfamilyThioredoxin superfamily

60-80% of proteins are multi-domain

few thousand domain superfamilies (< 10,000 CATH, SCOP and Pfam)

> Two million domain combinations (multi-domain architectures)

The Mosaic Theory of Protein Evolution Teichmann et al 2001,2003 Gerstein et al. 2001

Similarity in Chemistry

conserved

semiconserved

poorly conserved

unconserved

I

I

I

I’

P

P

P

P

P

PP

P’

19%

67%

7%

7%

nearly 90% of families show full or partial conservation of functions

chemistry is conserved or semi-conserved across the family but the substrate can change

HO NHNH

OH

O

NH2

O O

O

S

S

NHOHNHHO

O

O

O

NH2

O

thioredoxindisulphide bond

H2O2 Hg2+

cytochromeP450s

FAD/NAD(P)(H)-dependentdisulphide oxidoreductases

hexapeptiderepeat proteins

C OO

OH H

+

NO2

HONH

HOO

Cl

Cl

N

N

NH2

N

NO

OPO

OH

OPiO OH

O

OH

O

NH

O

NHS

O

+

OH

O

O

OH

O

HO

O

OH

O

OH

NO

blade domain

fulcrum domain

handle domain

How representative are these How representative are these structural superfamilies (ie in CATH, structural superfamilies (ie in CATH,

SCOP) of all proteins in nature?SCOP) of all proteins in nature?

::DomainDomain structure predictions in genome structure predictions in genome sequencessequences

scan againstlibrary of sequence

patterns (HMM models) for

CATH

protein sequencesprotein sequencesfrom UniProtfrom UniProt ~ 26 million domain ~ 26 million domain

sequences assigned sequences assigned toto

CATH superfamiliesCATH superfamilies

~6000 annotated ~6000 annotated genomesgenomes

Pfam-APfam-BOther

CATH and Pfam coverage of genomes

51%

33%

16%

CATH Pfam Unassigned-regionNewFam?

Protein Family DatabasesProtein Family Databases

Each family is represented by a sequence profile or HMM