Classification: understanding the diversity and principles of

MCSG 2001 structures

protein structure and function

Classification: understanding the diversity and principles of

Protein structure classification

Main reference: Robert B. Russell (2002) Classification of Protein Folds. Molecular Biotechnology 20:17-28.

Importance: central to studies of protein structure, function, and evolution

Philosophy: phyletic vs. phenetic Method: structure comparison + human know

ledge

Philosophy of classification

Phyletic: based on phylogenetic relationship

Phenetic: based on study of phenomena (phenomelogical)

Classification Unit: Domain, a LEGO piece

Ranganathan

From domain to assembly

Domains are shuffled, duplicated and fused to make proteins

On average, a domain is of 173 a.a. in size, compared to 466 a.a. for a yeast protein

Most of the natural domain sequences assume one of a few thousand folds, of which ~1000 are already known

no satisfactory estimate yet for the number of macromolecular complexes

On average, a yeast complex may consist of 7.5 proteins

Sali et al. 2003

Distribution of Protein size

Swiss-prot

Structural vs. functional domain

Russian doll: a conceptual problem

Singh

Approaches

Hierarchical Based on the types and arrangements of

secondary structures Unit (level): domain Domain assignment - structural vs. functional (fold or function

in isolation) - automated assignment methods

(structure vs. sequence)

A. P. Singh

Assignment of Class

All or All (could be subjective) / ( unit) or + Other classes

Class assignment could be subjective

All-alpha structures

All-beta structures

Superoxide dimutase

Alpha/beta structures

Closed barrel Open twisted sheet

B-a-b motif

(barrel) (sheet)

a/b vs. a+b

Assignment of Fold

Defined by the number, type, and arrangement of SSEs

Connectivity (e.g. circular permutation, scrambled proteins)

Assignment of Superfamily Homologous even in the absence of

significant sequence similarity - certain level of structural similarity - unusual structural features - low but significant sequence similarity

from structural alignment - key active site residues - sequence similarity bridges Divergence vs. convergence

Divergent vs. convergent evolution

Divergent evolution: decent from a common ancestor; become variant due to mutation

Convergent evolution: no common ancestor; become similar due to functional or physical constraint

Anti-freeze protein: convergent evolution

crystal.biochem.queensu.ca

Homologous fold

Ranganathan

Analogous fold

Ranganathan

CN

N’ C’

N’

C’

NC

Scallop Myosin Regulatory Domain C chain

Aldehyde Oxidoreductase A chain

Analogous or homologous?

Assignment of Family

significant sequence similarity

Classification databases

SCOP - careful assignment of evolutionary

relationships; homologous vs. analogous CATH - A:architecture FSSP - a list of structural neighbors

CATH

Singh

Class: SSE composition & packing

Architecture: overall shape of domain, ignore SSE connectivity

Topology (Fold): consider connectivity

Homologous superfamily: a common ancestor

Classification databasesCATH Class, Architecture, Topolgy, and Homologo

us superfamily, a hierarchical classification of protein domain structureshttp://www.biochem.ucl.ac.uk/bsm/cath_new/

SCOP Structural Classification Of Proteins: augmented manual classificationhttp://scop.mrc-lmb.cam.ac.uk/scop/

FSSP Fold classification based on Structure-Structure alignment of Proteinshttp://www2.ebi.ac.uk/dali/fssp/

Genome-scale structure analysis

Curr. Opin. Str. Biol., 2003

genome-scale structure annotation

Some statistics 80% of sequence families belong to 400 folds (top 10 folds

account for 40% of sequence families) >60% of genes encode multi-domain proteins (80% for euk

aryotes) ~50,000 protein families and ~150,000 singletons structural superfamilies ~1800 (+/-50) and ~10,000 unifolds 50-60% of distant homologs (<25% seq. id.) can be recogni

zed by profile-based sequence comparison methods (e.g. psi-blast, HMM, etc)

50-60% of the enzymes in yeast and E coli are common, and >80% of pathways are shared

superfolds, superfamilies, supersites

TIM barrel, Rossmann-like, ferredoxin-like, b-propellers, 4-helix bundle, Ig-like, b-jelly rolls, Oligonucleotide/oligosaccharride binding (OB) fold, SH3-like.

Structure -> function (only 50% correct)

Structure implicates function?

Assessing the Progress of Structural Genomics Projects

1 Nov. 2002, Science

Target Tracking by PDB (Sep 2002)

PDB content growth (May 2005)

Some statistics Contributed 316 non-redundant PDB entries compri

sing 459 CATH and 393 SCOP domains by 11 SG consortia.

14% of the targets have a homolog (>30% sequence identity) solved by another consortium

67% of SG domains in CATH are unique vs. 21% of non-SG domains.

19% and 11% contributed new superfamilies and new folds, respectively.

Allow new and reliable homology models for 9287 non-redundant gene sequences in 208 completely sequenced genomes.

PSI Structure Statistics2002-2003

Unique structures (30% seq. ID) PSI 70%

PDB 10% New folds

PSI 12%PDB 3%

NIGMS Protein Structure Initiative

Average total cost per structurePSI Pilot phase

01 $650 K (7 centers)02 $400 K (9

centers)03 $240 K04 ?05 $100 K (goal)

PSI-2 Production phase

06-10 $50 K (goal)Comparison ~$250-300 K


PSI Pilot Phase -- Lessons Learned

1. Structural genomics pipelines can be constructed and scaled-up

2. High throughput operation works for many proteins

3. Genomic approach works for structures4. Bottlenecks remain for some proteins5. A coordinated, 5-year target selection

policy must be developed6. Homology modeling methods need

improvementNIGMS Protein Structure Initiative

PSI-2 Production Phase (2005)

Interacting network for high throughput protein structure determination with three components Large-scale centers for protein structure

production of selected targets Specialized centers for technology

development leading to high throughput structure determination of difficult proteins

Specialized centers for protein structures relevant to disease (other NIH Institutes and Centers)

Included in NIH Structural Biology Roadmap plans


Computational structural genomics

Summary table

Fold occurrence matrix

Common Folds

Unique Folds

Main findings

Folds can be assigned to ~25% ORF and ~20% amino acids for the 20 genomes

>80% scop folds identified in one of the 20 organisms Worm and E. coli have most distinct folds Level of gene duplication (2.4 folds in MG, 32 in worm) higher th

an observed based on sequence only Top three most common folds: P-loop NTP hydrolase, the ferro

doxin fold, TIM-barrel Unique folds tend to be those involved in cell defense (e.g. toxin

s) Common folds tend to be more “symmetrical”

Fold evolution

Insertion, deletion, substitution

a-helix & b-sheet substitution in Rossmann-fold like proteins

A path from all-b to all-a proteins

N

C

AB

DC N

C

AB

DC

..A..B..C..D.. ..C..D..A..B..

Circular Permutation (CP)

1nls (Concanavalin) 1led (Lectin)

N

C

N C

Circular permutation example

Strand invasion/withdraw

Hairpin flips/swaps

Sickel-cell hemoglobin confers resistance to malaria

Hemoglobin &sickle cell anemia

Lethal legos as killer clumps

The inherited form of Lou Gehrig's disease--familial amyotrophic lateral sclerosis (FALS)--causes a decay of the motor neurons in the spinal cord and brain, a devastating loss of bodily control, and death within 2 to 5 years.

Elam et al. Nat. Str. Biol., 2003

Classification: understanding the diversity and principles of

Documents

Transcript of Classification: understanding the diversity and principles of