Protein structure classification
description
Transcript of Protein structure classification
![Page 1: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/1.jpg)
CATH and SCOP
Topic 8Chapters 17 & 18, Gu and Bourne “ Structural Bioinformatics”
![Page 2: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/2.jpg)
SCOP vs. CATH
The basic classification unit is domain. -- A distinct structural unit and may fold as an independent, compact unit. -- Considered as the basic unit of protein folding, function and evolution. -- Domain partition, either manual or automatic, is not trivial.
CATH is semi-automatic SCOP is mainly manual (human expertise)
Similarity level
high
relationship
clear
lowclass
fold
superfamilyfamily
![Page 3: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/3.jpg)
Protein domains
Pyruvate kinase Elongation faction EF-Tu
![Page 4: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/4.jpg)
SCOP vs. CATH
“CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: 1093-1108, 1997.
![Page 5: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/5.jpg)
Protein Structure Classification
Why bother? Provides structural and evolutionary relationship Provides current fold space Assists protein structure prediction (details later)
Two popular protein classification databases: SCOP (Structural Classification Of Proteins ) http://scop.mrc-lmb.cam.ac.uk/scop/ Latest release: v1.75 (June 2009)
110,800 domains Murzin et al. J. Mol. Biol. 247, 536-540, 1995
CATH: Class (C), Architecture (A), Topology (T) and Homologous superfamily (H). http://www.cathdb.info/ Recent release: v3.5 (Sept 2011) 173,536 domains Orengo et al. Structure, 5, 1093-1108, 1997
![Page 6: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/6.jpg)
Hierarchical Structure Classification
“CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: 1093-1108, 1997.
![Page 7: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/7.jpg)
Hierarchical Structure Classification
SCOP• Class ()• Fold (TIM beta/alpha-barrel)• Superfamily (Triosephosphate isomerase)• Family (Triosephosphate isomerase)
CATH• Class ()• Architecture (-barrel)• Topology (TIM barrel)• Homologous Superfamily (Aldolase class 1)• Sequence Family (Isomerase)
![Page 8: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/8.jpg)
Hierarchical Structure Classification
SCOP C F S F
CATH C A T H S
Conservation:
2o str
uctu
re c
onte
nt
High
-leve
l str
uctu
re si
mila
rity
Typi
cally
ort
holo
gs
Low
er-le
vel s
truc
ture
sim
ilarit
y
![Page 9: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/9.jpg)
This seems trivial.Why are we wasting valuable class time discussing it?
• While a hierarchical description of protein structure is conceptually straightforward, as you will see, automating it is not. Moreover, the domain boundary problem is actually quite difficult.
• Also, this discussion is nice in the sense that it ties together a lot of different bioinformatics concepts into one unified effort. Some of these concepts are structural; however, many are not.
![Page 10: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/10.jpg)
Flavodoxin (toplogy = Rossman fold)
Domain 1 of -lactamase – Same architecture, but different topology
Topology vs. Architecture
Caution: Due to how secondary structures are interconnected, varying topologies can converge on the same overall architecture.
![Page 11: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/11.jpg)
An even trickier example
“CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: 1093-1108, 1997.
![Page 12: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/12.jpg)
![Page 13: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/13.jpg)
The CATH Classification Strategy
(1.) Close relatives are identified via sequence comparisons.
(2.) Sequence profiles and structure comparison protocols are used to detect more distant homologies.
(3.) Structures unclassified at this stage are then examined using both automatic and manual procedures to determine domain boundaries.
(4.) Unclassified domain structures are recomputed using the methods employed in steps 2 and 3.
(5.) Finally, structure(s) remaining unclassified are manually assigned to existing or new architectures within CATH.
![Page 14: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/14.jpg)
The CATH Classification Strategy
“CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: 1093-1108, 1997.
Automatic procedure“If a given domain has sufficiently high sequence and structural similarity (ie. 35% sequence identity, SSAP score >= 80) with a domain that has been previously classified in CATH, the classification is automatically inherited from the other domain”.
![Page 15: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/15.jpg)
CATH Classification-Domain Assignment
http://www.cathdb.info/
Since the classification is performed on individual domains, therefore the very first step is to assign domains (find domain boundaries)
Use both automatic and manual techniques
If it has high sequence identity (80%) and structural similarity (SSAP score >= 80) with a protein chain X that has been classified in CATH, use the boundaries of X.
Otherwise, apply several domain partition programs (CATHEDRAL, DETECTIVE (Swindells, 1995), PUU (Holm & Sander, 1994), DOMAK (Siddiqui and Barton, 1995).
Consensus assign automatically No consensus assign manually.
Domain Partition Problem
SSAP (Sequential Structure Alignment Program)
Structure Comparison Problem
![Page 16: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/16.jpg)
The CATH Hierarchy and Classification
http://www.cathdb.info/
Class, C-level Based on the secondary structure content of the domain There are four classes:
1. mainly-alpha, 2. mainly-beta,3. alpha-beta, (a combination of / and + in SCOP)4. low secondary structure content
![Page 17: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/17.jpg)
The CATH Hierarchy and Classification
http://www.cathdb.info/
Architecture, A-level This level describes the overall shape of the domain structure as determined by the orientations of the secondary structures but ignores the connectivity between the secondary structures.
It is assigned manually using a simple description of the secondary structure arrangement .
![Page 18: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/18.jpg)
The CATH Hierarchy and Classification
http://www.cathdb.info/
Topology (Fold family), T-level Members in the fold family share the same overall shape and connectivity of the secondary structures in the domain core.
Domains in the same fold group may have different structural decorations to the common core.
![Page 19: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/19.jpg)
The CATH Hierarchy and Classification
Topology (Fold family), T-level. 3.40.30.xx
![Page 20: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/20.jpg)
The CATH Hierarchy and Classification
Homologous Superfamily, H-level Protein domains in each H-group are thought to share a common ancestor and are homologous.
Similarities are identified either by high sequence identity or structure similarity using SSAP. Domains are classified in the same homologous superfamily if they satisfy one of the following criteria:
1. Sequence identity >= 35%, overlap >= 60% of larger structure equivalent to smaller.
2. SSAP score >= 80.0, sequence identity >= 20%, 60% of larger structure equivalent to smaller.
3. SSAP score >= 70.0, 60% of larger structure equivalent to smaller, and domains which have related functions, which is informed by the literature and Pfam protein family database (Bateman et al., 2004).
4. Significant similarity from HMM-sequence searches and HMM-HMM comparisons using SAM (Hughey &Krogh, 1996), HMMER (http://hmmer.wustl.edu) and PRC (http://supfam.org/PRC).
![Page 21: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/21.jpg)
The CATH Hierarchy and Classification
Sequence Family Levels: (S,O,L,I,D) Domains within each H-level are subclustered into sequence families using multi-linkage clustering at the following levels:
***The D-level is assigned as a counter within each S100 family to ensure that each domain in CATH has a unique CATHSOLID classification
http://www.cathdb.info/
![Page 22: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/22.jpg)
![Page 23: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/23.jpg)
Not completely manual: SCOP Workflow
Andreeva, et al, NAR, 2008
![Page 24: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/24.jpg)
SCOP: Structural Classification of Proteins
Family: Clear evolutionarily relationship (1) pairwise residue identities between the proteins are 30% and greater. (2) Proteins with low sequence similarity but very similar functions and structures; for
example, many globins have sequence identities of only 15%.
Superfamily: Probable common evolutionary origin Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily.
Fold: Major structural similarity (1) have same major secondary structures in same arrangement and with the same topological connections.
(2) Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies.
Class: secondary structure content and organization
Murzin et al. J. Mol. Biol. 247, 536-540, 1995
![Page 25: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/25.jpg)
![Page 26: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/26.jpg)
SCOP Parsable Files-very useful!!
![Page 27: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/27.jpg)
![Page 28: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/28.jpg)
![Page 29: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/29.jpg)
Common Folds
Immunoglobulin fold • All-β protein sandwich• Consists of 2 layers • ~7 antiparallel β-strands
arranged in two β-sheets• 28 SCOP superfamilies
Tim barrel fold • /β protein fold • Named after glycolytic
enzyme triosephosphate isomerase• Eight α-helices and eight
parallel β-strands• 33 SCOP superfamilies
Rossman fold • /β protein fold • Named after Michael
Rossman• Parallel β-strands
connected by -helices• 12 SCOP families
![Page 30: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/30.jpg)
Common Folds
“CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: 1093-1108, 1997.
![Page 31: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/31.jpg)
Immunoglobulin-like beta-sandwich
VL
CLVH
CH1
CH2
CH3
Antibody domains CuZn Superoxide Dismutase
![Page 32: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/32.jpg)
Red = Rossman fold domain within the enzyme alcohol dehydrogenase
![Page 33: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/33.jpg)
Rossman fold
TIM-Barrel fold
![Page 34: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/34.jpg)
TIM-barrel
33 different superfamilies that share the same fold
![Page 35: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/35.jpg)
Aldolase
Enolase
![Page 36: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/36.jpg)
TIM-barrel
![Page 37: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/37.jpg)
SC
OP
CAT
H
![Page 38: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/38.jpg)
“A systematic comparison of protein structure classifications: SCOP, CATH and FSSP” Caroline Hadley and David T Jones, Structure, 7(9): 1099-1112 , 1999
SCOP vs. CATH
The number of domains into which each chain is separated in S C O P and CATH is compared. For the most part, the two classification schemes agree on the number of domains per chain (5681 of 6875 chains is 82% ∼agreement). However, in the case of chains split into two domains in CATH, almost half are considered as only one domain within S C O P.
![Page 39: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/39.jpg)
SCOP vs. CATH
“A systematic comparison of protein structure classifications: SCOP, CATH and FSSP” Caroline Hadley and David T Jones, Structure, 7(9): 1099-1112 , 1999
SCOP: small proteinCATH: mainly
CATH ignores the presence of small β strands in the lysozyme superfamily and considers the protein mainly α
SCOP takes into account the functional and evolutionary importance of these strands, and classifies the lysozymes α/β.
![Page 40: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/40.jpg)
The Russian doll effect
The recurrence of common motifs within many of the superfolds and major architectures gives rise to an overlapof structures in these regions of fold space.
This means that it becomes harder to distinguish between structural families for these architectures and it is perhaps more appropriate to consider a continuum of protein folds.
This is particularly apparent in the layer-based sandwich architectures of the mainly β and α−β classes. For example, within the α−β three-layer doubly wound architectures, it is possible to generate a very large family of structures.
Each new structure added to a family will be related to the last by a simple extension of one or more βαβ motifs and the structures are then embedded within each other in a ‘Russian doll’ like effect.
“CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: 1093-1108, 1997.
![Page 41: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/41.jpg)
Meaning, the designations implied below aren’t so definitive
![Page 42: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/42.jpg)
Structural redundancy
(563)(423)
(283)
AN ASIDE: Commonly, SCOP/CATH classifications are used to remove structural redundancy from a dataset. For example, the plots are above are from a paper that my lab published characterizing a catalytic site prediction algorithm.
![Page 43: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/43.jpg)
Other ways of classifying proteins: EC #
The EC (enzyme classification) system creates a controlled vocabulary vis-à-vis enzyme function.
Ribose-phosphate diphosphokinase
![Page 44: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/44.jpg)
G6P Phosphatase vs. Hexokinase
Glucose-6-Phosphate PhosphataseEC 3.1.3.9
EC 2.7.1.1
![Page 45: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/45.jpg)
Other ways of classifying proteins: EC #
![Page 46: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/46.jpg)
From a former QE question
Given the following information about two proteins, Protein A and B, what can you tell about the structural and functional similarity/dissimilarity of the two proteins? Comment on the evolution of the two proteins.
EC Number CATH Classification
Protein A 3.4.21.1 2.40.10.10
Protein B 3.4.21.14 3.40.50.200
![Page 47: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/47.jpg)
From Wikipedia: Gene ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to:
1. Maintain and develop its controlled vocabulary of gene and gene product attributes;
2. Annotate genes and gene products, and assimilate and disseminate annotation data;
3. Provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis.
GO is part of a larger classification effort, the Open Biomedical Ontologies (OBO).
Other ways of classifying proteins: GO
![Page 48: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/48.jpg)
Other ways of classifying proteins: GO
![Page 49: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/49.jpg)
![Page 50: Protein structure classification](https://reader034.fdocuments.us/reader034/viewer/2022051317/5681672c550346895ddbce0a/html5/thumbnails/50.jpg)
Other ways of classifying proteins: KEGG
KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies