Basic Statistical Concepts - Unidad de...

52
Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical Concepts Jesús Vázquez Laboratorio de Química de Proteínas y Proteómica Centro de Biología Molecular Severo Ochoa-CSIC Córdoba Enero 2009

Transcript of Basic Statistical Concepts - Unidad de...

Page 1: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Basic Statistical Concepts

Jesús VázquezLaboratorio de Química de Proteínas y ProteómicaCentro de Biología Molecular Severo Ochoa-CSIC

CórdobaEnero 2009

Page 2: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

SCOPE hypergeometrical OMSSA

MASCOT ???OLAV (Phenyx)

SCO

RES

SCO

RES

Page 3: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Scenario:

• We search one MS/MS spectra against a proteindatabase

• We want to identify in a database the peptidethat has produced the MS/MS spectrum

• The first approach is to “score” the peptide candidates and select the peptide yielding the “best score”

Page 4: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

m/z

%R

elat

ive

Inte

n sity

Observed spectrum

m/z

%R

e lat

ive

Inte

nsity

Theoretical spectrum

Correlation scoring:How does SEQUEST work?

SEQUEST measures thedegree of correlation

Page 5: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Scenario:• We search one MS/MS spectra against a protein

database• We want to determine what is the sequence of

the peptide that has produced the MS/MSspectrum

• The first approach is to “score” the peptide candidates and select the peptide yielding the “best score”

• From previous experience using such score wecan approximately infer whether the peptidecandidate is correct or not (i.e. a very high Xcorrscore)

Page 6: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

SCOPE hypergeometrical OMSSA

MASCOT ???OLAV (Phenyx)

SCO

RES

SCO

RES

Page 7: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Scenario:• We search one MS/MS spectra against a protein

database; we want to determine what is the sequence ofthe peptide that has produced the MS/MS spectrum

• If we want to determine the statistical confidenceassociated to such peptide identification, we have to use the “probability distributions” corresponding to the “null-hypothesis”

• These distributions can be constructed empirically or onthe basis of a “MS/MS-sequence matching model”(theoretically)

• The outcome is a p-value and a E-value for the peptidematch

Page 8: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

SCOPE hypergeometrical OMSSA

MASCOT ???OLAV (Phenyx)

SCO

RES

SCO

RES

Page 9: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

0.971.091.212.731.251.121.852.111.952.18

1.031.121.252.741.271.171.902.121.972.21

1.061.171.262.761.271.171.922.231.982.23

1.071.261.282.781.281.201.942.242.002.29

1.101.331.312.791.301.221.952.282.002.31

1.111.331.332.801.301.251.962.302.092.33

1.121.371.372.811.311.272.042.372.152.41

1.221.381.412.921.471.342.072.392.192.51

1.321.441.493.091.561.432.142.522.372.58

1.351.601.653.133.183.205.315.646.016.71

10.000…87654321Spectrum #

Sco

res

XCorr

1

21

xxx

Cn−

SEQUEST:best score (Xcorr) and delta score (ΔCn)

Page 10: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Information provided by the best score (Xcorr) and the delta score (ΔCn)

Ranking of Peptide Sequences1 2 3 4 5 6 7 8 9

Scor

ebest score: evaluateshow good isthe best match

delta score: evaluateshow much the best scoredeviates from randombehavior

The second best scorealso evaluates deviationof the best score

Random matchingbehaviour

Page 11: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

SCOPE hypergeometrical OMSSA

MASCOT ???OLAV (Phenyx)

SCO

RES

SCO

RES

Page 12: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

0.971.091.212.731.251.121.852.111.952.18

1.031.121.252.741.271.171.902.121.972.21

1.061.171.262.761.271.171.922.231.982.23

1.071.261.282.781.281.201.942.242.002.29

1.101.331.312.791.301.221.952.282.002.31

1.111.331.332.801.301.251.962.302.092.33

1.121.371.372.811.311.272.042.372.152.41

1.221.381.412.921.471.342.072.392.192.51

1.321.441.493.091.561.432.142.522.372.58

1.351.561.523.111.651.522.212.632.482.65

35.000…87654321Spectrum #

Sco

res

XCorr

1

21

xxx

Cn−

Search the MS/MS spectrum against a “random”(1) database

(1) See later

Page 13: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

0

0.2

0.4

0.6

0.8

1

0.000 1.000 2.000 3.000 4.000Xcorr

Prob

abili

ty (r

ank/

N)

0

2000

4000

6000

8000

10000

0.000 1.000 2.000 3.000 4.000Xcorr

Ran

k

Construction of probability- score distributionsfor a single MS/MS spectrum

11.3510,000

………

0.00073.137

0.00063.186

0.00053.25

0.00045.314

0.00035.643

0.00026.012

0.00016.711

rank/Nscorerank

Take all scores from search against inverteddatabase; sort by decreasing score andcalculate normalized rank

N=

Accumulated frequencydistribution

Score distribution

Page 14: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Single-spectrum distributions$

and definition of p-value*• Make a database search of the

MS/MS spectrum against a verylarge collection of randomsequence candidates

• Sort the matches according to theparticular score and determine normalized rank

• * Is the probability that thespectrum matchs by chance a peptide sequence with a score equal or better than x

0

0.2

0.4

0.6

0.8

1

0.000 1.000 2.000 3.000 4.000Xcorr

Prob

abili

ty (r

ank/

N)

p-value*

xScore

$Also called survival functions by Fenyo & Beavis, 2003

Page 15: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Fitting of score distributions to calculate empiricallyp-values from database search results

• Construct the frequencydistributions of all peptide scoreswhile a MS/MS spectrum issearched against a database

• Plot the distribution in a semi-logcurve

• Fit the high-scoring portion of thecurve to a linear function

• Use fitting parameters to calculatep and e-values for the scores

Original reference of TANDEM, using SONAR: Fenyo & Beavis, 2003This fitting and extrapolation procedure is justified by the properties of extreme-value (Gumbel) distributions

0.0000001

0.000001

0.00001

0.0001

0.001

0.01

0.1

0 1 2 3 4 5ra

nk/N

Score

Page 16: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

p-value and expectation or E-value (I)

• Suppose that we get a peptide match with p=0.001 after searchingagainst N=1.000 sequence candidates: is that significant?

• The probability of getting one or more matches with p or lower whenthe event is repeated N times is equal to the probability of not havingN matches with (1-p), i.e.:

Npplowerorpwithmatchesmoreoroneprob N ≈−−= )1(1)(

• In this case Np=1, and therefore the match is not significant! In otherwords, we would expect that at least one peptide sequence wouldgive a match with p=0.001

• We need to take into account the number N of sequence candidatesto calculate statistical significance, i.e. the expectation value

Page 17: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

p-value and expectation or E-value (II)

• Expectation or e-value: is defined as theexpected number of matches with a p-value that would be obtained when Nsequence candidates are searched:

pNE ⋅=The E-value is a very common statistical confidenceparameter in bioinformatics, and is used by database-searching programs like BLAST

Page 18: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

p-value and expectation or E-value (III):conceptual difference

• The p-value is the probability that the MS/MSspectrum gets a given score against a randompeptide sequence

• The E-value is the probability that the MS/MSspectrum gets a given best-score against a collection of random peptide sequencecandidates

Page 19: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Conclusions on probability models

• Probabilities associated to individual peptide matches are computed by statistical, theoretical or mixed models

• The final outcome is the E-value

Page 20: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Scenario:

• We search a large collection of MS/MS spectraagainst a protein database

• We want a compromise between number ofidentified peptides and statistical confidence

• A) We need to control the sensitivity, accuracy, specificity… of the set of identified peptides

• B) We need to control and maintain at a reasonable level the proportion of false peptideassignations

Page 21: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Present Absent

Positive Condition Present + Positive result = True Positive

Condition absent + Positive result = False Positive (Type I error)

Negative Condition present + Negative result = False (invalid) Negative (Type II error)

Condition absent + Negative result = True (accurate) Negative

Test Result

Actual condition

Performance parameters related to a “2-class prediction problem”(also called “binary classification”):

“contingency table” or “confusion matrix”

We are searching for a compromise between Type I and Type II errors

Page 22: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

true positive rate (TPR)eqv. with hit rate, recall, sensitivity

TPR = TP / P = TP / (TP + FN)false positive rate (FPR)

eqv. with false alarm rate, fall-outFPR = FP / N = FP / (FP + TN)

accuracy (ACC)ACC = (TP + TN) / (P + N)

specificity (SPC)SPC = TN / (FP + TN) = 1 − FPR

positive predictive value (PPV)eqv. with precision

PPV = TP / (TP + FP)

threshold

TP

TNFN

FP

True assignationsFalse assignations

Note that the FPRequals to the

statistical significance(probability of having

a false positive by chance)!

Statistical concepts: sensitivity and FPR

false negative rate (FNR)FNR = FN / (TP + FN) = 1 - TPR

Page 23: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Statistical concepts: FDR• False Discovery Rate (FDR) (1), or simply error

rate,– is the estimated proportion of false assignations

among the set of identified peptides

TPFPFPFDR+

=

threshold

TP

TNFN

FP

True assignationsFalse assignations

(1) Incorrectly called FPR in some works (Elias & Gygi, 2007)

Page 24: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

threshold

TP

TNFN

FP

True assignationsFalse assignations

FDR

FPR, SPC

FNR, TPR

To calculate FNR or TPR we need to know in advance FN and TP(not possible in real world experiments)

In the practice we establish a threshold with a given FPR (stat. Significance)and calculate the associated FDR.

If not satisfactory, significance is varied until a good FDR is reached

Page 25: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

ROC* and identification/FDR curves

0 1

1

0

FPR

TPR

0 1

1.000

0

FDRId

entif

ied

pept

ides

Test dataset Real world experiment

-to test performance ofan discriminatory algorithm

-graphical display of identificationperformance in an experiment

optimumperformance

1%

*ROC: Receiver Operating Characteristic

Page 26: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Statistical concepts: FDRcommon questions

• What is the FDR if we identify 1.000 peptidesand statistically expect that 50 are false?

• How many false positives are expected if weidentify 100 peptides with a FDR of 0,1%?

• Less than 1 false peptide?? Is this statisticallycorrect? Would it not be better to use a 1% FDR in this case?

• And if we only identify 10 peptides? What FDR would we use?

Page 27: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Statistical concepts: FDR and FWR

• FWR (family-wise rate) is controlled by establishing that we have less than ONE expected false positive in an experiment.

• A good criterion is to use FWR when we havelow numbers of identified peptides and FDR when the numbers are large

• Caution: low numbers of identified peptides are difficult to control with a proper statistical model.

Page 28: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Statistical concepts: FDR andmultiple-hypothesis testing

• FDR is a parameter developed in the frameworkof multiple-hypothesis testing

• To identify 100 peptides is to assume that 100 individual and independent hypothesis are true

• When we say that we are making a multiple-hypothesis testing, we mean that we are usingthe FDR to control statistical significance

Page 29: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

How to calculate FDR (I)

• To calculate FDR it is useful to determine thedistribution of best scores in the experimentwhen all matches are random (i.e., false). [Thesewere called “average-score distributions” to differentiate them from the “single-spectrum distributions” used to calculate p-value]

• The simplest case is a searching engine withonly ONE score parameter

Page 30: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Average vs Single-Spectrum distributions (1)

0.971.09

1.212.731.251.121.852.111.952.18

1.031.12

1.252.741.271.171.902.121.972.21

1.061.17

1.262.761.271.171.922.231.982.23

1.071.26

1.282.781.281.201.942.242.002.29

1.101.33

1.312.791.301.221.952.282.002.31

1.111.33

1.332.801.301.251.962.302.092.33

1.121.37

1.372.811.311.272.042.372.152.41

1.221.38

1.412.921.471.342.072.392.192.51

1.321.44

1.493.091.561.432.142.522.372.58

1.351.60

1.653.133.183.205.315.646.016.71

1000…87654321

0.971.09

1.212.731.251.121.852.111.952.18

1.031.12

1.252.741.271.171.902.121.972.21

1.061.17

1.262.761.271.171.922.231.982.23

1.071.26

1.282.781.281.201.942.242.002.29

1.101.33

1.312.791.301.221.952.282.002.31

1.111.33

1.332.801.301.251.962.302.092.33

1.121.37

1.372.811.311.272.042.372.152.41

1.221.38

1.412.921.471.342.072.392.192.51

1.321.44

1.493.091.561.432.142.522.372.58

1.351.60

1.653.133.183.205.315.646.016.71

1000…87654321Spectrum #Sc

ores

XCorr

1

21

xxxCn

−=Δ

(1) As defined by Martín-Maroto (Martínez-Bartolomé et al., Mol.Cell Proteomics 2008)

Average

Single-Spectrum

Page 31: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Spectrum #

Sco

res

XCorr

Construction of SEQUEST average score distributions

0.971.091.212.731.251.121.852.111.952.18

1.031.121.252.741.271.171.902.121.972.21

1.061.171.262.761.271.171.922.231.982.23

1.071.261.282.781.281.201.942.242.002.29

1.101.331.312.791.301.221.952.282.002.31

1.111.331.332.801.301.251.962.302.092.33

1.121.371.372.811.311.272.042.372.152.41

1.221.381.412.921.471.342.072.392.192.51

1.321.441.493.091.561.432.142.522.372.58

1.351.601.653.133.183.205.315.646.016.71

10000…87654321

Take the set of best scores

Page 32: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

0

0.2

0.4

0.6

0.8

1

0.000 1.000 2.000 3.000 4.000Xcorr

Prob

abili

ty (r

ank/

N)

0

2000

4000

6000

8000

10000

0.000 1.000 2.000 3.000 4.000Xcorr

Ran

k

Construction of average score distributions

11.3510,000

………

0.00073.137

0.00063.186

0.00053.25

0.00045.314

0.00035.643

0.00026.012

0.00016.711

rank/Nscorerank

Take all (best) scores from search againstinverted database; sort by decreasing scoreand calculate normalized rank

N=

Accumulated frequencydistribution

Score distribution

Page 33: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Construction of average score distributions

11.3510,000

………

0.00073.137

0.00063.186

0.00053.25

0.00045.314

0.00035.643

0.00026.012

0.00016.711

rank/Nscorerank

N=

Classify scores in categories (bins)Construct histogramObtain the curve

0

1

2

3

4

5

6

7

8

0 1 2 3 4Xcorr

Prob

abiil

ty d

ensi

ty

Probability densitydistribution

0

1

2

3

4

5

6

7

8

0 1 2 3 4Xcorr

Prob

abiil

ty d

ensi

ty

0

1

2

3

4

5

6

7

8

0 1 2 3 4Xcorr

Prob

abiil

ty d

ensi

ty

Probability densitydistribution

Page 34: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

0

0.2

0.4

0.6

0.8

1

0.000 1.000 2.000 3.000 4.000Xcorr

Prob

abili

ty (r

ank/

N)

0

1

2

3

4

5

6

7

8

0 1 2 3 4Xcorr

Prob

abiil

ty d

ensi

ty

Statistical interpretationof average score distributions

Probability*Area:Probability*

Total areaUnder the curve = 1

•Probability of observing an score equal or better than x in the experiment•This is not the p-value!!!

x x

Page 35: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Score

Prob

abili

tyD

ensi

ty p-threshold (FP/F)

TP

FP

False positives

True positives

FDR=FP/(TP+FP)

Average score distributions: Probability and False Discovery Rate

Page 36: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

How to calculate FDR (II)

• When there is more than one score parameter, thesituation is more complex; we have to integrate all theparameters into a single one

• If the scores are truly independent, we could simplymultiply the probabilities

• SEQUEST yields several score parameters (Xcorr, ΔCn, RSp,…) that are not independent

• Existing methods only differ in the way they integrate thescores

Page 37: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

SEQUEST: the best score (Xcorr)and the delta score (ΔCn)

Ranking of Peptide Sequences1 2 3 4 5 6 7 8 9

Scor

ebest score: evaluateshow good isthe best match

delta score: evaluateshow much the best scoredeviates from randombehavior

The second best scorealso evaluates deviationof the best score

Random matchingbehaviour

Page 38: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

SCOPE hypergeometrical OMSSA

MASCOT ???OLAV (Phenyx)

SCO

RES

SCO

RES

Page 39: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Integrated Score

Average score distributions: Probability and False Discovery Rate

Prob

abili

tyD

ensi

ty p-threshold (FP/F)

TP

FP

False positives

True positives

FDR=FP/(TP+FP)

Page 40: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Use of random (decoy) databases tocalculate FDR

• Not all methods use decoy databases. However, this isvery robust and widely-accepted, and hence is themethod of choice

• Construct an identical database by reversing the orderof amino acid sequences of each protein (C-terminusbecomes N-terminus and viceversa), or by reversing theamino acid sequences of each peptide, maintaining N-terminal basic ends (pseudoreversal)

• Search against target and decoy databases• Use results from the decoy database to estimate the

number of false positives at a given score(s) thresholds

Peng et al., 2003; Strittmater et al., 2004; Cargile et al., 2004; Qian et al., 2005;Elias and Gygi, 2007

Page 41: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

•Randomizeddatabases do notbehave identically andoverestimate error rates (repetition ofdomains and/or motifsin the target databasereduce the total number of uniquesequences)

Elias and Gygi, 2007

Inverted or Randomized databases?

Page 42: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

The two methods to estimate FDRusing decoy databases

• Using separate database searches: – Search against target and decoy databases and select tentative identifications

by using the same criteria:• D = number of peptides identified in the decoy dbase (falses)• T = number of peptides identified in the target dbase (total)

– Calculate proportion of falses: FDR = D / T

• Using a concatenated database:– Construct a composite database by joining the target and decoy databases– Search against the composite database and count up how many matches were

made against the decoy database– The number of false positive assignations is estimated by doubling up the

matchs against the decoy database:• FDR = 2*D / (D+T)

• The first method was demonstrated to overestimate the number of falsepositives (good MS/MS spectra produce high scores in the decoydatabase). The second method avoids this effect by allowing target anddecoy sequences to compete for best scores.

Page 43: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

A refined (integrated) estrategy to calculateFDR from decoy databases

0

1

2

3

4

0 1 2 3 4 5 6

Xcorr (target)

Xcor

r (d

ecoy

)

202 (171)33 (21)

122 (96)

1666 (1613)

tbdbdo

totu

du

Navarro and Vázquez (J. Proteome Res. 2009 in the press)

)1(totbdbdotbdbFDRSD ++

++=

( ) )2(2totbdbdo

dbdoFDRCD ++++

=

)3(2totbdb

dbdoFDRTD +++

=

Page 44: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

The refined estrategy is more sensitive

)1(totbdbdotbdbFDRSD ++

++=

( ) )2(2totbdbdo

dbdoFDRCD ++++

= )3(2totbdb

dbdoFDRTD +++

=

0

0.1

0.2

0 0.1 0.2

FDRTD

FDR

SD o

r FD

RC

DFD

RSD

orFD

RC

D

0

0.1

0.2

0 0.1 0.2FDRTD

FD

R SD o

r FD

RCD

FDR

SDor

FDR

CD

0

0.1

0.2

0 0.1 0.2

FDRTD

FDR

SD o

r FD

RC

DFD

RS

Dor

FDR

CD

FDRTDFDRTDFDRTD

SEQUEST Xcorr Mascot Prob.Ratio

Page 45: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Statistics and peptide identification: the key points

• Statistical significance of peptide identification from a single MS/MS spectrum is given by the E-value

• E-values are calculated from single-spectrum scoredistributions. These distributions are determinedempirically or using theoretical models

• In large-scale peptide identification experiments we mustuse the FDR criterion

• FDR can be calculated from the average scoredistributions and/or by searching against a decoy(inverted) database

Page 46: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Statistics and Quantification

• All relative quantification approaches(arrays, 2DE, DIGE, isotope labeling, label-free…) use the same statistical background:– A null hypothesis (NH) model (no expression

changes) is established– A expression change is considered significant

on the basis of the NH probability:• The probability that the change can be explained

by the null hypothesis

Page 47: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

The three major problems in quantitative experiments

1.- Establishing a NH model adequate forthe experimental approach

2.- Checking that the NH model can be applied to our particular experiment

3.- Determining statistical significance on thebasis of a multiple-hypothesis testingapproach

Page 48: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Gaussian distribution is the mostcommon NH model

• Central Limit Theorem establishes that the mean of a large number of determinations has a normal distribution

• Expression ratios are affected by multiplicativefactors

• Ratios in a log scale make the factors additive• Hence the most probable model for log(Ratios)

is a gaussian distribution• Caution: a gaussian model implies a constant

variance!!

Page 49: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

1.- Validity of the NH model for theexperimental approach

• It is usually assumed a priori (use ofcommercial software packages). A normal distribution is almost always used

• NH model validated in only a very limitednumber of cases

• NH models from arrays or other expapproaches often extrapolated to othersituations

Page 50: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

2.- Checking that the NH is valid forour own experiment

• Only a very limited numbers of papers testvalidity of the employed NH model

• Validity of the NH model may be tested in the same experiment or preferably makinga test experiment

Page 51: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

3.- Multiple hypothesis testing

• We detect several expression changes (not onlyone); hence we are making a multiple check ofthe NH hypothesis

• Example: we detect 100 expression changesamong 1.000 quantified proteins with p<0.05. How many are expected to belong to the NH (i.e. to be false)? (taken from a recent MCP paper)

• When we study more than one expressionchange we must use the FDR instead of theprobability!!

Page 52: Basic Statistical Concepts - Unidad de Proteómicaestrellapolar.cnb.csic.es/proteored/docs/Bioinfo_Course_Jan09/3... · Basic Statistical Concepts. Jesús Vázquez. CBMSO Basic Statistical

Basic Statistical Concepts. Jesús Vázquez. CBMSO

3.- Multiple hypothesis testing: calculation of the FDR

1.- Establish the NH hypothesis (i.e. a normal distribution)2.- Calculate the p(NH) for each one of the proteins3.- Establish a p-threshold and calculate the expected

number of falses=Np4.- Count up how many positives we observe below the p-

threshold (O)5.- Calculate FDR=Np/O6.- Change p, recalculate FDR and iterate until the desired

FDR is obtained