David Jones AFP/CAFA2011
-
Upload
iddo -
Category
Technology
-
view
565 -
download
0
Transcript of David Jones AFP/CAFA2011
Combining large-scale evolutionary analyses with multiple biological data sources to predict human protein function
David Jones UCL Depts. of Computer Science and Structural and Molecular Biology
Background
MF
BP
CC
30%
In Uniprot, 30% of human proteins still have no functional annotations at all
MF
BP
CC
… and only 0.5% have completely specific ones for all aspects
Main approaches for function annotation
• Annotation transfers by homology e.g. BLAST, HMMER Only applicable to a subset of the data Has reached a plateau in terms of novel function
annotation but provides highest quality information
• Model-classifier based using sequence features Limited to common and broad functions for which there
are many examples
FFPRED - Function Prediction Pipeline
posterior probability estimate
GO Term SVM
Amino acid sequence
structure disorder motifs localisation
Novel sequence
Characteristics
Classification
aa transmem
Going further – computing gene function from multiple data sources
• FFPRED is a currently available server for human (and vertebrate) proteins
• It works well but is limited to predicting only the
functional classes that it was trained to recognize • Extending the library requires time consuming
training of new SVM models • It also cannot be applied to rare functional classes
due to limited training sets
Desirable features of a new approach
• Able to annotate all sequences
• Able to predict rare functions
• Able to offer something more than simple homology-based approaches
• Amenable to easy and quick updating
FunctionSpace Data Sources for H. sapiens
• Sequence similarity • Signal peptides and other local features • Predicted secondary structure • Transmembrane segments • Predicted disordered regions • Domain architecture patterns • Gene fusion information • Gene co-expression • Protein-protein interactions
For each sequence 49,231 features were derived
Aim
Functional Similarity
Score
To estimate the functional similarity (a.k.a. semantic distance) between two human proteins from their sequence features plus available high throughput data.
Protein A
Protein B
Large-scale (domain-based) evolutionary features
• Patterns of domain occurrence can provide valuable functional clues
• “Deeper” homology detection allows greater
coverage
• We make use of our in-house fold/domain recognition method and several public domain libraries
pDomTHREADER Domain Coverage
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
Public domain Threading
37.56 % increase in domain annotations across 5.5M sequences
~ 1.7 million novel domain assignments over public domain data
CATH Domain annotations
59.4% Gene3d 64.8% threading
81.6% threading
35.7% Gene3d Residues
Sequences
Computational Practicalities
2Gb
5.5M Query sequences
PSIBLAST
Sequence database
(5.5M seqs)
1min – 3 hours
Store & post process
Legion Nodes
“Embarrassingly parallel” application: one sequence = one job.
Ideal capacity filling task for a modern supercomputer like Legion.
Find matches & generate alignments
Gene Fusion Events can Predict Protein-Protein Interactions from Sequence Data
H1 H2
Mycobacterium avium
Mycobacterium tuberculosis
Mycobacterium paratuberculosis
3.90.850.10
3.90.850.10 3.60.15.10
3.60.15.10
fumaryl aceto acetase beta lactamase
Hydrolase activity
Hydrolysis of C-N bonds Hydrolysis of C-C bonds
Bi-functional enzyme
Saccharopolyspora erythraea
Syntrophomonas wolfei
3.40.120.10
Alpha-D-Glucose-1,6-Bisphosphate P-loop nucleotide triphosphate hydrolases
Oxidative stress
D-glucose metabolism DNA repair
3.40.120.10
3.40.120.10
3.40.50.300
3.40.50.300
Transcription coupling repair factor
DNA repair (RAD50)Phosphoglyceromutase
A Novel Gene Fusion Discovered using CATH domain fusion analysis
Novel Gene Fusion Discovery
3.40.50.3003.40.50.300
3.40.50.300 3.40.120.10
3.40.120.10
Novel annotations
Saccharopolyspora erythraea
Syntrophomonas wolfei
• Rice PGM1 gene annotated as GO:0006950 response to stress
• PGM3 has relationship with DNA repair sequence
Kanazawa K, Ashida H (1991) Relationship between oxidative stress and hepatic phosphoglucomutase activity in rats. Int J Tissue React 13: 225
Domain based features
Score complexes
Score architectures
7960 features 11210 features
Fusion scoring
Each domain is a feature, score has 2 components 1. Prediction quality (logistic transform of feature) 2. Promiscuity weight related to the number of times the sequence
occurs as part of a fused product wi = log fus i
Integration of “External” Features: Microarray Expression Data
Gene A
Gene B
Nor
mal
ised
Mic
roar
ray
Dat
aset
s
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Experiment (conditions) P
robe
Sig
nal (
log2
)
Pearson Correlation (R)
Biclustering Microarray Expression Data
Zinc binding sequences
global correlation 0.42
A set of transcription factors
global correlation 0.48
23912 features generated from biclustering of 2346 publicly available microarrays (81 experiments) using BIMAX algorithm
Functional Similarity
Score
Protein A
FunctionSpace: Two-stage Integration of Data
SVMsw
SVMss
SVMdis
SVMloc
SVMge
SVMppi
SVMtm
SVMdpc
SVMdpp
SVMgfc
SVMgfp
Feature vectors
Feature vectors Protein B
SVMfsc
A 3-D Projection of Annotated Human Proteins
• 49,231 dimensions first reduced to 11 dimensions by SVM regression with 11 different groups of features
• Each protein is here represented as a point in this derived 11-D feature space projected into 3-D
• Colouring is according to functional similarity which shows that proteins with similar functions (warmer colours) cluster strongly in this space
• 75% of nearest neighbour pairs share common GO terms
Individual Feature Contributions M
atth
ews
Cor
rela
tion
Coe
ffici
ent
Each sequence is classed “Easy”, “Medium” or “Hard” depending on degree of homology to functionally annotated proteins in UNIPROT.
Function Annotation Results for 20674 Unannotated IPI Human Sequences
Preliminary Results In 2009 FunctionSpace produced GO term predictions for 19678 IPI uncharacterized human sequences. 2746 have been annotated since.
Less specific
More specific
Less specific
More specific
MF Measure BP 16% % Exact Matches 9% -1.3 Mean semantic distance -1.7
Initial considerations for CAFA
• 50,000 sequences • 11 eukaryotic & 7 prokaryotic species • High specificity annotations needed • Partial descriptive text already in Swiss-Prot/Uniprot for some
entries
• FFPRED/FunctionSpace would not be enough
• Need to incorporate textual information from databases and comprehensive homology(orthology)-derived labels
• Need to get all this working in a few months!
Best Laid Plans for CAFA
• Plan A – Build separate annotation pipelines for missing data – Calibrate each pipeline according to precision values derived from
benchmark on 500 highly annotated Swiss-Prot entries – Combine pipeline annotations using high-level classifier (SVM or Naive
Bayes)
• Plan B
– No time to build high-level classifier! – Combine annotation sources using heuristic graphical approach
• Hope for the best! (and expect the worst...)
GO term prediction from Swiss-Prot text-mining
• For targets which already had
descriptive text, keywords or comments in Swiss-Prot, GO terms were assigned using a naive Bayes text-classification approach
• Single words and groups of 2 and 3 words were counted
• Words occurring in different Swiss-Prot record types were distinguished in the analysis, and some simple pre-parsing of feature (FT) records was carried out in addition.
Homology-based annotation sources
• PSI-BLAST searches against Uniprot – Low E-value threshold to ensure close homologues used for
annotation transfer – Alignment length threshold to avoid domain problem
• Transfer of annotations from orthologues – EggNOG 2.0 – More reliable GO term transfer than for PSI-BLAST but lower
coverage
• Profile-profile searches against Swiss-Prot – Low reliability transfer from very distant homologues – Improves coverage where needed (at expense of specificity)
P’ = 1 - (1 – P) (1 – Q)
Back-propagation of precision estimates
Heuristic back-propagation of precision estimates
Back-propagation repeated for each annotation source to define a consensus for each node
Final steps
• After back-propagation, all referenced GO terms are ranked according to final confidence scores
• To reduce conflicting annotations, pairs of terms with zero observed co-occurrence frequency in GOA are subjected to pairwise tournament selection.
• Results submitted to server using the mouse-window-cut-paste-click-submit algorithm
CASP vs CAFA from a Predictor’s Point of View • Number of targets
– Manual vs automated approaches • Difficulty of targets
– A major limit in driving CASP forwards • Assessment
– Hard to pre-judge impact of decisions made during prediction season
• Tools for the community – Standards and methods in CASP have been very useful
• Getting the word out to the wider community
Anna Lobley Domenico Cozzetto Daniel Buchan Kevin Bryson Christine Orengo
Acknowledgements