Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University...

19
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center

Transcript of Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University...

Page 1: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Novel Empirical FDR Estimation in PepArML

David Retz and

Nathan EdwardsGeorgetown University Medical Center

Page 2: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

What is PepArML?

Meta-search using seven search engines: Mascot; X!Tandem Native, K-Score, S-Score;

OMSSA; Myrimatch; InsPecT + MSGF Automatic target + decoy searches Automatic construction of search configuration Automatic spectra and sequence (re-)formatting

Heterogeneous cluster, grid, cloud computing Centralized scheduler Shared and private computational resources Integration with NSF TeraGrid and AWS

2

Page 3: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

What is PepArML?

A peptide identification result combiner Selects best identification, per spectrum Model-free, auto-train machine-learning Estimates false-discovery-rates Format output as pepXML and protXML

In use: more than 23M spectra, 1.4M search jobs, and 1TB in spectra and results.

PepArML identifies significantly more spectra than single search engines. Recovers more proteins with fewer replicates

3

Page 4: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

PepArML Performance

4

LCQ QSTAR LTQ-FT

Standard Protein Mix Database18 Standard Proteins – Mix1

Page 5: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

PepArML Advantages

Can accommodate new search engines or spectrum and peptide features easily

Learns the specific characteristics of each dataset from scratch!

Provides a platform for comparison of single search engine results with common FDR estimation procedure.

5

Page 6: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Search Engine Info. Gain

6

Page 7: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Precursor & Digest Info. Gain

7

Page 8: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Retention Time & Proteotypic Peptide Properties Info. Gain

8

Page 9: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Search Engine Independent FDR Estimation

Comparing search engines is difficult due to different FDR estimation techniques Implicit assumption: Spectra scores can be thresholded

Competitive vs Global Competitive controls some spectral variation

Reversed vs Shuffled Decoy Sequence Reversed models target redundancy accurately

Charge-state partition or Unified Mitigates effect of peptide length dependent scores What about peptide property partitions?

9

Page 10: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

PepArML Disadvantages

Training heuristic can fail to “get started” Works best on large datasets Iterative training can be time-consuming Machine-learning “confidence” is

uninterpretable for peptide identification Require two decoy-searches to “calibrate”

confidence as FDR Each spectrum searched ~ 21 times!

10

Page 11: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

PepArML Disadvantages

Training heuristic can fail to “get started” Works best on large datasets Iterative training can be time-consuming Machine-learning “confidence” is

uninterpretable for peptide identification Require two decoy-searches to “calibrate”

confidence as FDR Can we eliminate the internal decoy? Reduce search phase by 33%

11

Page 12: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Mascot OMSSATandem

Train Classifier & Predict Correct IDs

Stable?

Ouput Peptide Spectrum Assignments

Spectra

No

Yes

Recalibrate Confidence as FDR (D1)

Select "True" Proteins

Extract Peptides & Features

Select High-Quality IDs (D0)

Assign Training Labels

Select "True" Proteins

. . . . . .PepArML Workflow

Select high-quality IDs Guess true proteins from

search results Label spectra & train Calibrate confidence Guess true proteins from

ML results Iterate! Estimate FDR using

(external) decoy12

Page 13: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Select High-Quality Unanimous Peptide Identifications

Require fast and easy, but comparable search-engine metric.

13

min decoy hits min z-score

Page 14: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Simulate Decoy Results by Sampling Target Results

14

Target

Decoy

Sampled Target

Page 15: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Simulate Decoy Results by Sampling Target Results

15

Target

Decoy

Sampled Target

Page 16: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Sampled Target Approximates Decoy Calibration

Sample 75% non-training “false” target results

Rescale to # of spectra

Approximates FDR well-enough to replace internal decoy

16

Page 17: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Decoy-free PepArML results

17

LCQ QSTAR LTQ-FT

Standard Protein Mix Database18 Standard Proteins – Mix1

Page 18: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Conclusions

PepArML can significantly boost the number of spectra, peptides, and proteins identified Give it a try – free! Nothing to install!

A common FDR framework facilitates head-to-head comparison of search engines and FDR estimation techniques

Sampled target results can substitute for decoy results (internally) Reduces search time by 33%

18

Page 19: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

19

Acknowledgements

Growing list of PepArML users Fenselau lab (Maryland) Graham lab (JHU) Genovese lab (Bologna University, Italy)

Dr. Brian Balgley Bioproximity

Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science

Funding: NIH/NCI