Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich...

17
Using Prior Knowledge to Improve Scoring in High- Throughput Top-Down Proteomics Experiments • Rich LeDuc • Le-Shin Wu

Transcript of Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich...

Page 1: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments

• Rich LeDuc• Le-Shin Wu

Page 2: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

The “Scoring” Problem• Proteoforms are

hypotheses about what was in MS.

• The model “knows” the process.

• Output is a ranked list of hypotheses.

• Science builds on prior knowledge.

1

2

3

4

2

3

1

4

Competing Hypotheses

Ranked list of hypotheses,With measure of confidence,Under a given model

Process Model

Page 3: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

‘P score’ = Pf,n =

(xf)n x e-xf

n!

F. Meng, B. Cargile, L. Miller, J. Johnson, and N. Kelleher, Nat. Biotechnol., 2001, 19, 952-957.

f is the number of matching fragment ions,

n is the # of matches,

Ma is the Mass Accuracy

2211.111

1 aMx

Meng-Kelleher p-score

1

0

1n

ifn

nifncrude ppp

Page 4: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Meng-Kelleher p-score

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Pro

ba

bil

ity

Number Matching

Page 5: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Specific Example

Page 6: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Bayesian Approach

)(

)|()()|(

datap

proteoformdatapproteoformpdataproteoformp

Prior Probability of the

Proteoform

Likelihood of the Proteoform given the

observed data

Probability of the dataPosterior probability

of the proteoform after making the

observations

Page 7: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

The Scoring Model

}){,Pr(

)|}{,Pr()Pr(}){,|Pr(

iO

qiOqiOq mM

mMmM

From Bayes Theorem we have:

Pr(MO ,{mi} |q ) Pr(MO |q ) Pr(mi |q )i1

n

From independence we can:

j ijiqO

iqiqO

jjiOj

qiOqiOq mM

mM

mM

mMmM

)|Pr()|Pr(

)|Pr()|Pr(

)|}{,Pr()Pr(

)|}{,Pr()Pr(}){,|Pr(

Which gives our final scoring function:

Page 8: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

MS1 Generative Model

Given a certain theoretic proteoform, what is the probability of seeing the observed precursor mass?

Likelihood Fun Facts Area does not equal one. Need some level for “wrong

precursor mass”

Page 9: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Probability

Fragment Mass0 I

wi

Noise = k

mi

MS2 Generative Model

Page 10: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

MS2 Generative Model

otherwise

region epermissibl a

in not but I,mfor

mfor

0

)|( i

i

mi

ji

t

k

t

w

mp

i

mim lIkwt )2)1((2

Page 11: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Lambda Scores

kprior

1

Assume that prior to scoring, each sequence had an equal probability of being the correct sequence. This means that if we are considering k sequences, then our prior probability is just:

So then, the ratio of the posterior over the prior is:

)ln(

)(1

ˆ

postk

postk

k

post

prior

post

ratio, thisof log the take

Page 12: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Lambda Spread

-200

-150

-100

-50

0

50

0.000 0.020 0.040 0.060 0.080 0.100 0.120

p-score

lam

bd

a

The lambda score spreads hits with the same number of matching fragment ions.

Page 13: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Room for Improvement

Initial Version I to Max of all proteoforms

Theoretical Mass Theoretical Mass

One set of real observations scored against 890,000 random “theoretical” proteoforms.

Page 14: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Scoring Models Compared

Ahlf, D.R., Compton, P.D., Tran, J.C., Early, B.P., Thomas, P.M., Kelleher, N.L. “Evaluation of the Compact High-Field Orbitrap for Top-Down Proteomics of Human Cells”, J. ProteomeRes., 2012, 11, 4308-4314. PMCID: PMC3437942.

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.100.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Original Model

LS + Fragment Matching

LS + Intensity Correlation

P Scores

FDR (1-Specificity)

TP

R (

Sen

siti

vity

)

Page 15: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Future Directions

Add oxidation for MS1.

Improve modeling of various processes.

Incorporate into a search engine.

Page 16: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Conclusions

Include prior knowledge: Science builds on itself.

There is a system that gives a framework for including prior knowledge in models.

This particular implementation is better than older scoring systems, and it can improve!

Page 17: Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

Acknowledgements and Questions Kelleher group for providing the data.

All my many colleagues who I have worked with on this project over the years.

Of course all the related funding agencies, but specifically NSF ABI-1062432 .