Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University...
-
Upload
aubrie-wright -
Category
Documents
-
view
216 -
download
0
Transcript of Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University...
Novel Empirical FDR Estimation in PepArML
David Retz and
Nathan EdwardsGeorgetown University Medical Center
What is PepArML?
Meta-search using seven search engines: Mascot; X!Tandem Native, K-Score, S-Score;
OMSSA; Myrimatch; InsPecT + MSGF Automatic target + decoy searches Automatic construction of search configuration Automatic spectra and sequence (re-)formatting
Heterogeneous cluster, grid, cloud computing Centralized scheduler Shared and private computational resources Integration with NSF TeraGrid and AWS
2
What is PepArML?
A peptide identification result combiner Selects best identification, per spectrum Model-free, auto-train machine-learning Estimates false-discovery-rates Format output as pepXML and protXML
In use: more than 23M spectra, 1.4M search jobs, and 1TB in spectra and results.
PepArML identifies significantly more spectra than single search engines. Recovers more proteins with fewer replicates
3
PepArML Performance
4
LCQ QSTAR LTQ-FT
Standard Protein Mix Database18 Standard Proteins – Mix1
PepArML Advantages
Can accommodate new search engines or spectrum and peptide features easily
Learns the specific characteristics of each dataset from scratch!
Provides a platform for comparison of single search engine results with common FDR estimation procedure.
5
Search Engine Info. Gain
6
Precursor & Digest Info. Gain
7
Retention Time & Proteotypic Peptide Properties Info. Gain
8
Search Engine Independent FDR Estimation
Comparing search engines is difficult due to different FDR estimation techniques Implicit assumption: Spectra scores can be thresholded
Competitive vs Global Competitive controls some spectral variation
Reversed vs Shuffled Decoy Sequence Reversed models target redundancy accurately
Charge-state partition or Unified Mitigates effect of peptide length dependent scores What about peptide property partitions?
9
PepArML Disadvantages
Training heuristic can fail to “get started” Works best on large datasets Iterative training can be time-consuming Machine-learning “confidence” is
uninterpretable for peptide identification Require two decoy-searches to “calibrate”
confidence as FDR Each spectrum searched ~ 21 times!
10
PepArML Disadvantages
Training heuristic can fail to “get started” Works best on large datasets Iterative training can be time-consuming Machine-learning “confidence” is
uninterpretable for peptide identification Require two decoy-searches to “calibrate”
confidence as FDR Can we eliminate the internal decoy? Reduce search phase by 33%
11
Mascot OMSSATandem
Train Classifier & Predict Correct IDs
Stable?
Ouput Peptide Spectrum Assignments
Spectra
No
Yes
Recalibrate Confidence as FDR (D1)
Select "True" Proteins
Extract Peptides & Features
Select High-Quality IDs (D0)
Assign Training Labels
Select "True" Proteins
. . . . . .PepArML Workflow
Select high-quality IDs Guess true proteins from
search results Label spectra & train Calibrate confidence Guess true proteins from
ML results Iterate! Estimate FDR using
(external) decoy12
Select High-Quality Unanimous Peptide Identifications
Require fast and easy, but comparable search-engine metric.
13
min decoy hits min z-score
Simulate Decoy Results by Sampling Target Results
14
Target
Decoy
Sampled Target
Simulate Decoy Results by Sampling Target Results
15
Target
Decoy
Sampled Target
Sampled Target Approximates Decoy Calibration
Sample 75% non-training “false” target results
Rescale to # of spectra
Approximates FDR well-enough to replace internal decoy
16
Decoy-free PepArML results
17
LCQ QSTAR LTQ-FT
Standard Protein Mix Database18 Standard Proteins – Mix1
Conclusions
PepArML can significantly boost the number of spectra, peptides, and proteins identified Give it a try – free! Nothing to install!
A common FDR framework facilitates head-to-head comparison of search engines and FDR estimation techniques
Sampled target results can substitute for decoy results (internally) Reduces search time by 33%
18
19
Acknowledgements
Growing list of PepArML users Fenselau lab (Maryland) Graham lab (JHU) Genovese lab (Bologna University, Italy)
Dr. Brian Balgley Bioproximity
Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science
Funding: NIH/NCI