Improving the Sensitivity of Peptide Identification for Genome Annotation
description
Transcript of Improving the Sensitivity of Peptide Identification for Genome Annotation
Improving the Sensitivityof Peptide Identification for Genome Annotation
Nathan EdwardsDepartment of Biochemistry and
Molecular & Cellular Biology
Georgetown University Medical Center
2
Why Tandem Mass Spectrometry?
MS/MS spectra provide evidence for the amino-acid sequence of functional proteins.
Key concepts: Spectrum acquisition is unbiased Direct observation of amino-acid sequence Sensitive to small sequence variations
3
Mass Spectrometer
Ionizer
Sample
+_
Mass Analyzer Detector
• MALDI• Electro-Spray
Ionization (ESI)
• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap
• ElectronMultiplier(EM)
4
Mass Spectrum
5
Mass is fundamental
6
Sample Preparation for MS/MS
Enzymatic Digestand
Fractionation
7
Single Stage MS
MS
8
Tandem Mass Spectrometry(MS/MS)
Precursor selection
9
Tandem Mass Spectrometry(MS/MS)
Precursor selection + collision induced dissociation
(CID)
MS/MS
10
Unannotated Splice Isoform
Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003.
LIME1 gene: LCK interacting transmembrane adaptor 1
LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias.
Multiple significant peptide identifications
11
Unannotated Splice Isoform
12
Unannotated Splice Isoform
13
Splice Isoform Anomaly
Human erythroleukemia K562 cell-line Depth of coverage study Resing et al. Anal. Chem. 2004. Peptide Atlas A8_IP
SALT1A2 gene: Sulfotransferase family, cytosolic, 1A
2 ESTs, 1 mRNA mRNA from lung, small cell-cancinoma sample
Single (significant) peptide identification Five agreeing search engines PepArML FDR < 1%. All source engines have non-significant E-values
14
Splice Isoform Anomaly
15
Splice Isoform Anomaly
16
Translation start-site correction
Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane
and soluble cytoplasmic proteins Goo, et al. MCP 2003.
GdhA1 gene: Glutamate dehydrogenase A1
Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0
prediction(s)
17
Halobacterium sp. NRC-1ORF: GdhA1
K-score E-value vs PepArML @ 10% FDR Many peptides inconsistent with annotated
translation start site of NP_279651
0 40 80 120 160 200 240 280 320 360 400 440
18
Translation start-site correction
19
Phyloproteomics
Tandem mass-spectra of proteins (top-down)
High-accuracy instrument (Orbitrap, UMD Core)
Proteins from unsequenced bacteria matching identical proteins in related organisms
Demonstration using Y.rohdei.
20
Phyloproteomics
21
Phyloproteomics
phylogeny.fr – "One-Click"
Protein Sequence 16S-rRNA Sequence
22
Shared "Biomarker" Proteins
23
Phyloproteomics
Recent extension to highly homologous proteins in related organisms Merely require N- and/or C-terminus in common Broadens applicability considerably
Phyloproteomic trees for E.herbicola and Enterocloacae, neither sequenced.
New paradigm for phylogenetic analysis?
24
Lost peptide identifications
Missing from the sequence database
Search engine strengths, weaknesses, quirks
Poor score or statistical significance
Thorough search takes too long
25
Searching under the street-light…
Tandem mass spectrometry doesn’t discriminate against novel peptides...
...but protein sequence databases do!
Searching traditional protein sequence databases biases the results in favor of well-understood and/or computationally predicted proteins and protein isoforms!
26
All amino-acid 30-mers, no redundancy From ESTs, Proteins, mRNAs
30-40 fold size, search time reduction Formatted as a FASTA sequence database One entry per gene/cluster.
Peptide Sequence Databases
Organism Size (AA) Size (Entries)Human 248Mb 74,976Mouse 171Mb 55,887
Rat 76Mb 42,372Zebra-fish 94Mb 40,490
27
We can observe evidence for…
Known coding SNPs Unannotated coding mutations Alternate splicing isoforms Alternate/Incorrect translation start-sites Microexons Alternate/Incorrect translation frames
…though it must be treated thoughtfully.
28
PeptideMapper Web Service
I’m Feeling Lucky
29
PeptideMapper Web Service
I’m Feeling Lucky
30
PeptideMapper Web Service
I’m Feeling Lucky
31
PeptideMapper Web Service
Suffix-tree index on peptide sequence database Fast peptide to gene/cluster mapping “Compression” makes this feasible
Peptide alignment with cluster evidence Amino-acid or nucleotide; exact & near-exact
Genomic-loci mapping via UCSC “known-gene” transcripts, and Predetermined, embedded genomic coordinates
32
Comparison of search engine results
No single score is comprehensive
Search engines disagree
Many spectra lack confident peptide assignment
Searle et al. JPR 7(1), 2008
38%
14%28%
14%
3%
2%
1%
X! Tandem
SEQUESTMascot
33
Combining search engine results – harder than it looks!
Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!
How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?
We apply unsupervised machine-learning.... Lots of related work unified in a single framework.
34
Supervised Learning
35
Unsupervised Learning
36
Peptide Atlas A8_IP LTQ Dataset
37
Running many search engines
Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially
modifications and protein identifiers
38
Peptide Identification Meta-Search Simple unified search
interface for: Mascot, X!Tandem,
K-Score, OMSSA, MyriMatch
Automatic decoy searches
Automatic spectrumfile "chunking"
Automatic scheduling Serial, Multi-
Processor, Cluster, Grid
39
NSF TeraGrid1000+ CPUs
X!Tandem,KScore,OMSSA,
MyriMatch,Mascot(1 core).
X!Tandem,KScore,OMSSA,
MyriMatch.
PepArML Meta-Search Engine
UMIACS250+ CPUs
Edwards LabScheduler &48+ CPUs
Securecommunication
Heterogeneouscompute resources
Single, simplesearch request
Scales easily to 250+ simultaneous
searches
40
PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs
Edwards LabScheduler &80+ CPUs
Securecommunication
Heterogeneouscompute resources
Single, simplesearch request
Scales easily to 250+ simultaneous
searches
X!Tandem,KScore,OMSSA,
MyriMatch,Mascot(1 core).
X!Tandem,KScore,OMSSA,
MyriMatch.
41
PepArML Meta-Search Engine
NSF TeraGrid1000+ CPUs
UMIACS250+ CPUs
Edwards LabScheduler &48+ CPUs
Securecommunication
Heterogeneouscompute resources
Simple searchrequest
42
PepArML Meta-Search Engine
NSF TeraGrid1000+ CPUs
UMIACS250+ CPUs
Edwards LabScheduler &48+ CPUs
Securecommunication
Heterogeneouscompute resources
Simple searchrequest
43
Peptide Identification Grid-Enabled Meta-Search
Access to high-performance computing resources for the proteomics community NSF TeraGrid Community Portal University/Institute HPC clusters Individual lab compute resources Contribute cycles to the community
and get access to others’ cycles in return.
Centralized scheduler Compute capacity can still be exclusive, or prioritized. Compute client plays well with HPC grid schedulers.
44
PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs
UMIACS250+ CPUs
Edwards LabScheduler &80+ CPUs
X!Tandem,KScore,OMSSA,
MyriMatch,Mascot(1 core).
X!Tandem,KScore,OMSSA,
MyriMatch.
X!Tandem,KScore,OMSSA.
45
PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs
UMIACS250+ CPUs
Edwards LabScheduler &80+ CPUs
X!Tandem,KScore,OMSSA,
MyriMatch,Mascot(1 core).
X!Tandem,KScore,OMSSA,
MyriMatch.
X!Tandem,KScore,OMSSA.
46
PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs
UMIACS250+ CPUs
Edwards LabScheduler &80+ CPUs
X!Tandem,KScore,OMSSA,
MyriMatch,Mascot(1 core).
X!Tandem,KScore,OMSSA,
MyriMatch.
X!Tandem,KScore,OMSSA.
UMD Proteomics CoreScheduler & 2 CPUs
47
PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs
UMIACS250+ CPUs
Edwards LabScheduler &80+ CPUs
X!Tandem,KScore,OMSSA,
MyriMatch,Mascot(1 core).
X!Tandem,KScore,OMSSA,
MyriMatch.
X!Tandem,KScore,OMSSA.
UMD Proteomics CoreScheduler & 2 CPUs
48
Conclusions
Improve the scope and sensitivity of peptide identification for genome annotation, using
Exhaustive peptide sequence databases Machine-learning for combining Meta-search tools to maximize consensus Grid-computing for thorough search
http://edwardslab.bmcb.georgetown.edu
49
Acknowledgements
Dr. Catherine Fenselau & students University of Maryland Biochemistry
Dr. Yan Wang University of Maryland Proteomics Core
Dr. Art Delcher University of Maryland CBCB
Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science
Funding: NIH/NCI