Post on 27-Mar-2015
Genome Annotation of Protein Function using Structural Data:
Catalytic Residue Information
Janet Thornton
European Bioinformatics Institute
ISMB/ECCB 2004
Glasgow
From Structure to Functional Annotation
From Structure To Biochemical Function
Gene Protein 3D Structure Function
Given a protein structure: Where is the functional site? What is the multimeric state of the protein? Which ligands bind to the protein? What is biochemical function?
Automated Structure Comparison
The most powerful method for assigning function from structure is global or partial 3D structure comparison (e.g. Dali, SSAP; SSM)
Hidden Markov Models derived from structural domains can often recognise distant relatives from sequence
Predicting Binding SiteBinding-site analysis: cutA
Most likely binding site
Surface clefts
Residue conservation
Conserved surface patches
Identifying Binding Site Function Using Motifs
- 3D enzyme active site structural motifs (Craig Porter)
- Catalytic Site Atlas - Identification of catalytic residues (Gail Bartlett, Alex Gutteridge)
- Metal binding sites (Malcolm MacArthur)
- Binding site features (Gareth Stockwell)
- Automatically generated templates of ligand-binding and
- DNA binding motifs (Sue Jones, Hugh Shanahan)
- “Reverse” templates (Roman Laskowski)
JESS – fast template search algorithm (Jonathan Barker)
Using information on Catalytic Residues derived from Structures
Catalytic Site Atlas
Using info for annotation of enzymes in genomes
3D Templates
The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data.
Craig T. Porter, Gail J. Bartlett, and Janet M. Thornton Nucl. Acids. Res. 2004 32: D129-D133.
http://www.ebi.ac.uk/thornton-srv/databases/CSA
Catalytic Site Information
Enzyme reports from primary literature information
-lactamase Class A EC: 3.5.2.6 PDB: 1btl Reaction: -lactam + H2O -amino acid Active site residues: S70, K73, S130, E166 Plausible mechanism:
N
O
OH
N H 2
OH
S e r
L y s
S e r
N H 3 +
O
H
O
N
O
S e r
L y s
S e r
N H 3 +
O
O
NH
O
O
O
OH
H
S e r
L y s
S e r
G l u
OO H
O
OHO
NH
O
H
N H
S e r
L y s
S e r
G l u
Annotates catalytic residues in the PDB Based on a dataset of 514 enzyme families
Representative catalytic site for each family Homologues assigned by Psi-BLAST Limited substitution allowed. Homologues updated monthly.
Literature references Data also available via MSDsite
http://www.ebi.ac.uk/thornton-srv/databases/CSA http://www.ebi.ac.uk/msd-srv/msdsite
CSA Coverage
512 Representative Sites 9075 PDB Files20001 Catalytic Sites
Class In CSA In PDB
E.C. 1.-.-.- Oxidoreductases. 194 / 271E.C. 2.-.-.- Transferases. 151 / 280E.C. 3.-.-.- Hydrolases. 221 / 421 E.C. 4.-.-.- Lyases. 96 / 122E.C. 5.-.-.- Isomerases. 44 / 63E.C. 6.-.-.- Ligases. 33 / 58
Total 739 / 1215
(Current 512 Enzyme Dataset)
Metal Site Atlas
Annotates Metal Sites in PDB Similar to CSA database Searchable by:
PDB code Swiss-Prot code Homologues.
Dataset includes: Copper, Zinc, Calcium, Iron (excl. hemes),
Cobalt, Magnesium, Manganese, Molybdenum, Nickel and Tungsten.
Metal Site Atlas Contents
Templates: 46 Cu 195 Zn 270 Ca 83 Fe
6 Co 86 Mg 45 Mn 10 Mo 7 Ni 4 W
752 Total Templates
Sites in MSA: 6301 PDB Files 25374 Metal Binding Sites
Comparison of CSA v1.0 with Swiss-Prot and PDB Site Annotations
CSA v1.0 - Literature
EC Wheels
CSA v1.0 – plus homologues
iCSA: Using Functional Residue Conservation to Improve Function Annotation
Starting with over 500 enzymes from the CSA, with EC numbers and high
quality catalytic site information
Retrieve homologues from BiopendiumTM
Align homologues with query enzyme, using
PSI-BLAST profiles
CLUSTAL W multiple alignments
Smith and Waterman pairwise alignments
Check for conservation of catalytic residues
If all residues are conserved, assign EC from annotated enzyme to homologue
Also deals with mutation, etc. if necessary
Testing the iCSA Method Searches with 517 CSA sites retrieved over 30700 Swiss-
Prot sequences within four iterations of PSI-BLAST These were assigned three digit EC numbers using the
iCSA method The assigned EC numbers were then compared with the
EC annotation given in the Swiss-Prot database The accuracy of EC assignment was compared with the
accuracy achieved using sequence homology (i.e. PSI-BLAST)
CSA query
enzymeSwiss-Prot
HomologuesiCSA filteredhomologues
Homology search
Function assignment
by homology
Function assignment using CSA
iCSAfilter
EC Assignment Accuracy
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 all
PSI-BLAST iteration
% E
Cs
as
sig
ne
d c
orr
ec
tly
SequencehomologyDescribedmethodCSA
Correct EC assignedAn EC assigned
Improvement in EC Assignment Accuracy, Compared with Homology Alone
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4
PSI-BLAST iteration
% i
mp
rove
men
t
48% overall
AccuracyiCSA-AccuracyHomology
AccuracyHomology
iCSA vs. Sequence Homology Alone
The accuracy of EC assignment is improved
by using iCSA The improvement in accuracy is more pronounced with more
distant homologues: from 7% at iteration 1 to 88% at iteration 4
Overall, EC assignment accuracy is improved by 48%
Overall, EC assignment accuracy using iCSA is 86%
(vs. 58% using sequence homology alone)
iCSA EC Coverage
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 all
PSI-BLAST iteration
% c
ove
rag
e%
co
vera
ge
PSI-BLAST iteration
Correct EC assignedHomologues with correct EC
iCSA vs. Sequence Homology Alone
iCSA coverage is 78% overall
The iCSA is right to reject many of these homologues
even though they have the same EC as the CSA site
used as the query
EC covered by more than one specific catalytic site
Incorrect EC assignment in Swiss-Prot
But misaligned sequences are also possible, especially
with more distant homologues
iCSA Correctly Rejects Homologues
The iCSA accuracy with the CSA trypsin site is 100%
The benefits of the iCSA method can be seen in the homologues
not assigned the trypsin EC
Trypsin homologues that do not pass the catalytic residue checks
in iCSA include several haptoglobin proteins
Haptoglobin is closely related to trypsin, but is a known non-
enzyme
Sequence homology alone would assign these haptoglobin
sequences the trypsin EC, but iCSA can correctly identify that the
residues for catalysis are not present
Human Genome Annotation
We applied iCSA to the human ENSEMBL sequence database The iCSA directly annotated 2064 sequences with an EC
Only 64% of these have an equivalent Swiss-Prot protein at least 90% pairwise sequence identity and a difference
in length of less than 10% of the shorter sequence
So 743 sequence annotations have been efficiently expanded A further 2257 homologues did not have a conserved site and an
EC was not assigned
73% of the equivalent Swiss-Prot sequences had an alternative EC number to the iCSA query
Homology-based functional assignments in these cases could prove incorrect
Summary
iCSA methodology developed Database currently contains:
7013 PDBs (11710 chains) 18033 Swiss-Prot sequences 4321 Human ENSEMBL sequences 4227 Mouse ENSEMBL sequences
Poster E-37Session 1 (Sunday)
3D Templates to Characterise Functional Sites
Template searches
(189 enzyme active site templates)
(~600 Metal binding site templates)
GARTfaseCholesterol oxidaseIIAglc histidine kinase
Carbamoylsarcosineamidohhydrase
Dihydrofolate reductase Ser-His-Aspcatalytic triad
…
Database of enzyme active site templates
189 templates
MCSG structure
BioH – unknown function involved in biotin synthesis in E.coli
An example
Structure: Rossmann fold, hence many structural homologues
Expected to be an enzyme
Sequence contains two Gly-X-Ser-X-Gly motifs typical ofacyltransferases and thioesterases
Ser-His-Asp catalytic triad of the lipases with rmsd=0.28Å
(template cut-off is 1.2Å)
CSA template searchOne very strong hit
Experimentally confirmed by hydrolase assays
Novel carboxylesterase acting on short acyl chain substrates
Generation of 3D Active Site Templates for Enzymes in the Catalytic Site Atlas
Gail J Bartlett*, James W Torrance, Craig T Porter, Jonathan A Barker, Alex Gutteridge, Malcolm W MacArthur, Janet M Thornton
EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK* Centre For Bioinformatics, Biochemistry Building, Imperial College London, South Kensington Campus, London SW7 2AZ, UK
1. Introduction
Structural templates can be used to search protein structures for particular patterns of residues, such as catalytic sites. Structural templates are thus a tool for predicting protein function. There are many methods that employ structural templates, but no reliable template libraries. The Catalytic Site Atlas1 is a database of catalytic residues within proteins of known structure. This information can be used to create a template library. We hope to use this library to uncover cases of convergent evolution and to predict function from structure.
2. Objectives
•To use the Catalytic Site Atlas to create a library of structural templates representing catalytic sites
•To assess the effectiveness of these templates for identifying proteins with a particular catalytic function
4. Results
•No correlation between RMSD of template atoms and percentage pairwise sequence identity found within homologous enzyme families
•Majority of RMSD values between templates from homologous family members were below 1Å
•Templates distinguish related enzymes well in most families, with > 75% of relatives having RMSDs better than that of any random match.
•Some families showed wide variation of catalytic residue geometry, making prediction difficult.
•Templates based on C / C atoms performed slightly better than those which used functional atoms.
3. Methods
Template generation and analysis of active site geometry
Two types of template were created (atoms used are highlighted in ball form):
Templates within the same homologous enzyme family were superposed and the distribution of RMSDs examined.
Assessing template effectiveness
The Jess template-matching method2 was used to query all the templates against a non-redundant subset of the PDB. Hits were scored using both RMSD and a statistical significance measure. The effectiveness of hits was measured by comparing scores of hits between relatives with scores from random hits identified in the PDB.
C and C atoms Three “functional” atoms
6. A “bad” template - fructose 1,6-bisphosphatase
It is difficult to construct a sensitive template for fructose 1,6-bisphosphatase because one catalytic residue is on a flexible loop that moves when AMP binds at an allosteric site.
5. A “good” template - aldolase A
Aldolase A relatives superpose well (below right) and there is a clear separation between these and random hits to PDB (below left).
Superposition of homologous family
templates
Open form
Catalytic residues
Flexible loop
AMP
Loop closed
Structures of open form
Structures of closed form
Distribution of RMSDs of hits to aldolase template (based on PDB 1ald)
°
Poster Number I76 - Monday
Template databases HAND CURATED
Enzyme active sites (PROCAT) – 189 templates
Currently being extended
Metal-binding sites – 600 templates
AUTOMATED Ligand-binding sites – 10,000 templates
DNA-binding sites – 800 templates
ProFunc – function from 3D structure
Homologous sequences of known function
Binding site identification and analysis
Homologous structures of known function
Functional sequence motifsQ-x(3)-[GE]-x-C-[YW]-x(2)-[STAGC]
Enzyme active site 3D-templates
HTH-motifs Electrostatics Surface comparison
… etc
DNA-, ligand- binding and “reverse” templates
Residue conservation analysis
Acknowledgements
CSA: Craig Porter, Gail Bartlett, Alex Gutteridge, Malcolm MacArthur (EBI), Neera Borkakoti
Genome Annotation: Ruth Spriggs, Richard George, Mark Swindells, B. Al-Lazikhani (Inpharmatica)
ProFunc: Roman Laskowski; James Watson (EBI)