Diversity of Life: Introduction to Biological Classification
Classification: understanding the diversity and principles of
description
Transcript of Classification: understanding the diversity and principles of
MCSG 2001 structures
protein structure and function
Classification: understanding the diversity and principles of
Protein structure classification
Main reference: Robert B. Russell (2002) Classification of Protein Folds. Molecular Biotechnology 20:17-28.
Importance: central to studies of protein structure, function, and evolution
Philosophy: phyletic vs. phenetic Method: structure comparison + human know
ledge
Philosophy of classification
Phyletic: based on phylogenetic relationship
Phenetic: based on study of phenomena (phenomelogical)
Classification Unit: Domain, a LEGO piece
Ranganathan
From domain to assembly
Domains are shuffled, duplicated and fused to make proteins
On average, a domain is of 173 a.a. in size, compared to 466 a.a. for a yeast protein
Most of the natural domain sequences assume one of a few thousand folds, of which ~1000 are already known
no satisfactory estimate yet for the number of macromolecular complexes
On average, a yeast complex may consist of 7.5 proteins
Sali et al. 2003
Distribution of Protein size
Swiss-prot
Structural vs. functional domain
Russian doll: a conceptual problem
Singh
Approaches
Hierarchical Based on the types and arrangements of
secondary structures Unit (level): domain Domain assignment - structural vs. functional (fold or function
in isolation) - automated assignment methods
(structure vs. sequence)
A. P. Singh
Assignment of Class
All or All (could be subjective) / ( unit) or + Other classes
Class assignment could be subjective
All-alpha structures
All-beta structures
Superoxide dimutase
Alpha/beta structures
Closed barrel Open twisted sheet
B-a-b motif
(barrel) (sheet)
a/b vs. a+b
Assignment of Fold
Defined by the number, type, and arrangement of SSEs
Connectivity (e.g. circular permutation, scrambled proteins)
Assignment of Superfamily Homologous even in the absence of
significant sequence similarity - certain level of structural similarity - unusual structural features - low but significant sequence similarity
from structural alignment - key active site residues - sequence similarity bridges Divergence vs. convergence
Divergent vs. convergent evolution
Divergent evolution: decent from a common ancestor; become variant due to mutation
Convergent evolution: no common ancestor; become similar due to functional or physical constraint
Anti-freeze protein: convergent evolution
crystal.biochem.queensu.ca
Homologous fold
Ranganathan
Analogous fold
Ranganathan
CN
N’ C’
N’
C’
NC
Scallop Myosin Regulatory Domain C chain
Aldehyde Oxidoreductase A chain
Analogous or homologous?
Assignment of Family
significant sequence similarity
Classification databases
SCOP - careful assignment of evolutionary
relationships; homologous vs. analogous CATH - A:architecture FSSP - a list of structural neighbors
CATH
Singh
Class: SSE composition & packing
Architecture: overall shape of domain, ignore SSE connectivity
Topology (Fold): consider connectivity
Homologous superfamily: a common ancestor
Classification databasesCATH Class, Architecture, Topolgy, and Homologo
us superfamily, a hierarchical classification of protein domain structureshttp://www.biochem.ucl.ac.uk/bsm/cath_new/
SCOP Structural Classification Of Proteins: augmented manual classificationhttp://scop.mrc-lmb.cam.ac.uk/scop/
FSSP Fold classification based on Structure-Structure alignment of Proteinshttp://www2.ebi.ac.uk/dali/fssp/
Genome-scale structure analysis
Curr. Opin. Str. Biol., 2003
genome-scale structure annotation
Some statistics 80% of sequence families belong to 400 folds (top 10 folds
account for 40% of sequence families) >60% of genes encode multi-domain proteins (80% for euk
aryotes) ~50,000 protein families and ~150,000 singletons structural superfamilies ~1800 (+/-50) and ~10,000 unifolds 50-60% of distant homologs (<25% seq. id.) can be recogni
zed by profile-based sequence comparison methods (e.g. psi-blast, HMM, etc)
50-60% of the enzymes in yeast and E coli are common, and >80% of pathways are shared
superfolds, superfamilies, supersites
TIM barrel, Rossmann-like, ferredoxin-like, b-propellers, 4-helix bundle, Ig-like, b-jelly rolls, Oligonucleotide/oligosaccharride binding (OB) fold, SH3-like.
Structure -> function (only 50% correct)
Structure implicates function?
Assessing the Progress of Structural Genomics Projects
1 Nov. 2002, Science
Target Tracking by PDB (Sep 2002)
PDB content growth (May 2005)
Some statistics Contributed 316 non-redundant PDB entries compri
sing 459 CATH and 393 SCOP domains by 11 SG consortia.
14% of the targets have a homolog (>30% sequence identity) solved by another consortium
67% of SG domains in CATH are unique vs. 21% of non-SG domains.
19% and 11% contributed new superfamilies and new folds, respectively.
Allow new and reliable homology models for 9287 non-redundant gene sequences in 208 completely sequenced genomes.
PSI Structure Statistics2002-2003
Unique structures (30% seq. ID) PSI 70%
PDB 10% New folds
PSI 12%PDB 3%
NIGMS Protein Structure Initiative
Average total cost per structurePSI Pilot phase
01 $650 K (7 centers)02 $400 K (9
centers)03 $240 K04 ?05 $100 K (goal)
PSI-2 Production phase
06-10 $50 K (goal)Comparison ~$250-300 K
NIGMS Protein Structure Initiative
PSI Pilot Phase -- Lessons Learned
1. Structural genomics pipelines can be constructed and scaled-up
2. High throughput operation works for many proteins
3. Genomic approach works for structures4. Bottlenecks remain for some proteins5. A coordinated, 5-year target selection
policy must be developed6. Homology modeling methods need
improvementNIGMS Protein Structure Initiative
PSI-2 Production Phase (2005)
Interacting network for high throughput protein structure determination with three components Large-scale centers for protein structure
production of selected targets Specialized centers for technology
development leading to high throughput structure determination of difficult proteins
Specialized centers for protein structures relevant to disease (other NIH Institutes and Centers)
Included in NIH Structural Biology Roadmap plans
NIGMS Protein Structure Initiative
Computational structural genomics
Summary table
Fold occurrence matrix
Common Folds
Unique Folds
Main findings
Folds can be assigned to ~25% ORF and ~20% amino acids for the 20 genomes
>80% scop folds identified in one of the 20 organisms Worm and E. coli have most distinct folds Level of gene duplication (2.4 folds in MG, 32 in worm) higher th
an observed based on sequence only Top three most common folds: P-loop NTP hydrolase, the ferro
doxin fold, TIM-barrel Unique folds tend to be those involved in cell defense (e.g. toxin
s) Common folds tend to be more “symmetrical”
Fold evolution
Insertion, deletion, substitution
a-helix & b-sheet substitution in Rossmann-fold like proteins
A path from all-b to all-a proteins
N
C
AB
DC N
C
AB
DC
..A..B..C..D.. ..C..D..A..B..
Circular Permutation (CP)
1nls (Concanavalin) 1led (Lectin)
N
C
N C
Circular permutation example
Strand invasion/withdraw
Strand invasion/withdraw
Strand invasion/withdraw
Hairpin flips/swaps
Hairpin flips/swaps
Sickel-cell hemoglobin confers resistance to malaria
Hemoglobin &sickle cell anemia
Lethal legos as killer clumps
The inherited form of Lou Gehrig's disease--familial amyotrophic lateral sclerosis (FALS)--causes a decay of the motor neurons in the spinal cord and brain, a devastating loss of bodily control, and death within 2 to 5 years.
Elam et al. Nat. Str. Biol., 2003