Basic Statistical Concepts - Unidad de...

Basic Statistical Concepts. Jesús Vázquez. CBMSO

Basic Statistical Concepts

Jesús VázquezLaboratorio de Química de Proteínas y ProteómicaCentro de Biología Molecular Severo Ochoa-CSIC

CórdobaEnero 2009


SCOPE hypergeometrical OMSSA

MASCOT ???OLAV (Phenyx)

SCO

RES

SCO

RES


Scenario:

• We search one MS/MS spectra against a proteindatabase

• We want to identify in a database the peptidethat has produced the MS/MS spectrum

• The first approach is to “score” the peptide candidates and select the peptide yielding the “best score”


m/z

%R

elat

ive

Inte

n sity

Observed spectrum

m/z

%R

e lat

ive

Inte

nsity

Theoretical spectrum

Correlation scoring:How does SEQUEST work?

SEQUEST measures thedegree of correlation


Scenario:• We search one MS/MS spectra against a protein

database• We want to determine what is the sequence of

the peptide that has produced the MS/MSspectrum

• The first approach is to “score” the peptide candidates and select the peptide yielding the “best score”

• From previous experience using such score wecan approximately infer whether the peptidecandidate is correct or not (i.e. a very high Xcorrscore)




SCO

RES

SCO

RES


Scenario:• We search one MS/MS spectra against a protein

database; we want to determine what is the sequence ofthe peptide that has produced the MS/MS spectrum

• If we want to determine the statistical confidenceassociated to such peptide identification, we have to use the “probability distributions” corresponding to the “null-hypothesis”

• These distributions can be constructed empirically or onthe basis of a “MS/MS-sequence matching model”(theoretically)

• The outcome is a p-value and a E-value for the peptidematch




SCO

RES

SCO

RES


0.971.091.212.731.251.121.852.111.952.18

1.031.121.252.741.271.171.902.121.972.21

1.061.171.262.761.271.171.922.231.982.23

1.071.261.282.781.281.201.942.242.002.29

1.101.331.312.791.301.221.952.282.002.31

1.111.331.332.801.301.251.962.302.092.33

1.121.371.372.811.311.272.042.372.152.41

1.221.381.412.921.471.342.072.392.192.51

1.321.441.493.091.561.432.142.522.372.58

1.351.601.653.133.183.205.315.646.016.71

10.000…87654321Spectrum #

Sco

res

XCorr

1

21

xxx

Cn−

=Δ

SEQUEST:best score (Xcorr) and delta score (ΔCn)


Information provided by the best score (Xcorr) and the delta score (ΔCn)

Ranking of Peptide Sequences1 2 3 4 5 6 7 8 9

Scor

ebest score: evaluateshow good isthe best match

delta score: evaluateshow much the best scoredeviates from randombehavior

The second best scorealso evaluates deviationof the best score

Random matchingbehaviour




SCO

RES

SCO

RES


0.971.091.212.731.251.121.852.111.952.18

1.031.121.252.741.271.171.902.121.972.21

1.061.171.262.761.271.171.922.231.982.23

1.071.261.282.781.281.201.942.242.002.29

1.101.331.312.791.301.221.952.282.002.31

1.111.331.332.801.301.251.962.302.092.33

1.121.371.372.811.311.272.042.372.152.41

1.221.381.412.921.471.342.072.392.192.51

1.321.441.493.091.561.432.142.522.372.58

1.351.561.523.111.651.522.212.632.482.65

35.000…87654321Spectrum #

Sco

res

XCorr

1

21

xxx

Cn−

=Δ

Search the MS/MS spectrum against a “random”(1) database

(1) See later


0

0.2

0.4

0.6

0.8

1

0.000 1.000 2.000 3.000 4.000Xcorr

Prob

abili

ty (r

ank/

N)

0

2000

4000

6000

8000

10000

0.000 1.000 2.000 3.000 4.000Xcorr

Ran

k

Construction of probability- score distributionsfor a single MS/MS spectrum

11.3510,000

………

0.00073.137

0.00063.186

0.00053.25

0.00045.314

0.00035.643

0.00026.012

0.00016.711

rank/Nscorerank

Take all scores from search against inverteddatabase; sort by decreasing score andcalculate normalized rank

N=

Accumulated frequencydistribution

Score distribution


Single-spectrum distributions$

and definition of p-value*• Make a database search of the

MS/MS spectrum against a verylarge collection of randomsequence candidates

• Sort the matches according to theparticular score and determine normalized rank

• * Is the probability that thespectrum matchs by chance a peptide sequence with a score equal or better than x

0

0.2

0.4

0.6

0.8

1

0.000 1.000 2.000 3.000 4.000Xcorr

Prob

abili

ty (r

ank/

N)

p-value*

xScore

$Also called survival functions by Fenyo & Beavis, 2003


Fitting of score distributions to calculate empiricallyp-values from database search results

• Construct the frequencydistributions of all peptide scoreswhile a MS/MS spectrum issearched against a database

• Plot the distribution in a semi-logcurve

• Fit the high-scoring portion of thecurve to a linear function

• Use fitting parameters to calculatep and e-values for the scores

Original reference of TANDEM, using SONAR: Fenyo & Beavis, 2003This fitting and extrapolation procedure is justified by the properties of extreme-value (Gumbel) distributions

0.0000001

0.000001

0.00001

0.0001

0.001

0.01

0.1

0 1 2 3 4 5ra

nk/N

Score


p-value and expectation or E-value (I)

• Suppose that we get a peptide match with p=0.001 after searchingagainst N=1.000 sequence candidates: is that significant?

• The probability of getting one or more matches with p or lower whenthe event is repeated N times is equal to the probability of not havingN matches with (1-p), i.e.:

Npplowerorpwithmatchesmoreoroneprob N ≈−−= )1(1)(

• In this case Np=1, and therefore the match is not significant! In otherwords, we would expect that at least one peptide sequence wouldgive a match with p=0.001

• We need to take into account the number N of sequence candidatesto calculate statistical significance, i.e. the expectation value


p-value and expectation or E-value (II)

• Expectation or e-value: is defined as theexpected number of matches with a p-value that would be obtained when Nsequence candidates are searched:

pNE ⋅=The E-value is a very common statistical confidenceparameter in bioinformatics, and is used by database-searching programs like BLAST


p-value and expectation or E-value (III):conceptual difference

• The p-value is the probability that the MS/MSspectrum gets a given score against a randompeptide sequence

• The E-value is the probability that the MS/MSspectrum gets a given best-score against a collection of random peptide sequencecandidates


Conclusions on probability models

• Probabilities associated to individual peptide matches are computed by statistical, theoretical or mixed models

• The final outcome is the E-value


Scenario:

• We search a large collection of MS/MS spectraagainst a protein database

• We want a compromise between number ofidentified peptides and statistical confidence

• A) We need to control the sensitivity, accuracy, specificity… of the set of identified peptides

• B) We need to control and maintain at a reasonable level the proportion of false peptideassignations


Present Absent

Positive Condition Present + Positive result = True Positive

Condition absent + Positive result = False Positive (Type I error)

Negative Condition present + Negative result = False (invalid) Negative (Type II error)

Condition absent + Negative result = True (accurate) Negative

Test Result

Actual condition

Performance parameters related to a “2-class prediction problem”(also called “binary classification”):

“contingency table” or “confusion matrix”

We are searching for a compromise between Type I and Type II errors


true positive rate (TPR)eqv. with hit rate, recall, sensitivity

TPR = TP / P = TP / (TP + FN)false positive rate (FPR)

eqv. with false alarm rate, fall-outFPR = FP / N = FP / (FP + TN)

accuracy (ACC)ACC = (TP + TN) / (P + N)

specificity (SPC)SPC = TN / (FP + TN) = 1 − FPR

positive predictive value (PPV)eqv. with precision

PPV = TP / (TP + FP)

threshold

TP

TNFN

FP

True assignationsFalse assignations

Note that the FPRequals to the

statistical significance(probability of having

a false positive by chance)!

Statistical concepts: sensitivity and FPR

false negative rate (FNR)FNR = FN / (TP + FN) = 1 - TPR


Statistical concepts: FDR• False Discovery Rate (FDR) (1), or simply error

rate,– is the estimated proportion of false assignations

among the set of identified peptides

TPFPFPFDR+

=

threshold

TP

TNFN

FP


(1) Incorrectly called FPR in some works (Elias & Gygi, 2007)


threshold

TP

TNFN

FP


FDR

FPR, SPC

FNR, TPR

To calculate FNR or TPR we need to know in advance FN and TP(not possible in real world experiments)

In the practice we establish a threshold with a given FPR (stat. Significance)and calculate the associated FDR.

If not satisfactory, significance is varied until a good FDR is reached


ROC* and identification/FDR curves

0 1

1

0

FPR

TPR

0 1

1.000

0

FDRId

entif

ied

pept

ides

Test dataset Real world experiment

-to test performance ofan discriminatory algorithm

-graphical display of identificationperformance in an experiment

optimumperformance

1%

*ROC: Receiver Operating Characteristic


Statistical concepts: FDRcommon questions

• What is the FDR if we identify 1.000 peptidesand statistically expect that 50 are false?

• How many false positives are expected if weidentify 100 peptides with a FDR of 0,1%?

• Less than 1 false peptide?? Is this statisticallycorrect? Would it not be better to use a 1% FDR in this case?

• And if we only identify 10 peptides? What FDR would we use?


Statistical concepts: FDR and FWR

• FWR (family-wise rate) is controlled by establishing that we have less than ONE expected false positive in an experiment.

• A good criterion is to use FWR when we havelow numbers of identified peptides and FDR when the numbers are large

• Caution: low numbers of identified peptides are difficult to control with a proper statistical model.


Statistical concepts: FDR andmultiple-hypothesis testing

• FDR is a parameter developed in the frameworkof multiple-hypothesis testing

• To identify 100 peptides is to assume that 100 individual and independent hypothesis are true

• When we say that we are making a multiple-hypothesis testing, we mean that we are usingthe FDR to control statistical significance


How to calculate FDR (I)

• To calculate FDR it is useful to determine thedistribution of best scores in the experimentwhen all matches are random (i.e., false). [Thesewere called “average-score distributions” to differentiate them from the “single-spectrum distributions” used to calculate p-value]

• The simplest case is a searching engine withonly ONE score parameter


Average vs Single-Spectrum distributions (1)

0.971.09

1.212.731.251.121.852.111.952.18

1.031.12

1.252.741.271.171.902.121.972.21

1.061.17

1.262.761.271.171.922.231.982.23

1.071.26

1.282.781.281.201.942.242.002.29

1.101.33

1.312.791.301.221.952.282.002.31

1.111.33

1.332.801.301.251.962.302.092.33

1.121.37

1.372.811.311.272.042.372.152.41

1.221.38

1.412.921.471.342.072.392.192.51

1.321.44

1.493.091.561.432.142.522.372.58

1.351.60

1.653.133.183.205.315.646.016.71

1000…87654321

0.971.09

1.212.731.251.121.852.111.952.18

1.031.12

1.252.741.271.171.902.121.972.21

1.061.17

1.262.761.271.171.922.231.982.23

1.071.26

1.282.781.281.201.942.242.002.29

1.101.33

1.312.791.301.221.952.282.002.31

1.111.33

1.332.801.301.251.962.302.092.33

1.121.37

1.372.811.311.272.042.372.152.41

1.221.38

1.412.921.471.342.072.392.192.51

1.321.44

1.493.091.561.432.142.522.372.58

1.351.60

1.653.133.183.205.315.646.016.71

1000…87654321Spectrum #Sc

ores

XCorr

1

21

xxxCn

−=Δ

(1) As defined by Martín-Maroto (Martínez-Bartolomé et al., Mol.Cell Proteomics 2008)

Average

Single-Spectrum


Spectrum #

Sco

res

XCorr

Construction of SEQUEST average score distributions

0.971.091.212.731.251.121.852.111.952.18

1.031.121.252.741.271.171.902.121.972.21

1.061.171.262.761.271.171.922.231.982.23

1.071.261.282.781.281.201.942.242.002.29

1.101.331.312.791.301.221.952.282.002.31

1.111.331.332.801.301.251.962.302.092.33

1.121.371.372.811.311.272.042.372.152.41

1.221.381.412.921.471.342.072.392.192.51

1.321.441.493.091.561.432.142.522.372.58

1.351.601.653.133.183.205.315.646.016.71

10000…87654321

Take the set of best scores


0

0.2

0.4

0.6

0.8

1

0.000 1.000 2.000 3.000 4.000Xcorr

Prob

abili

ty (r

ank/

N)

0

2000

4000

6000

8000

10000

0.000 1.000 2.000 3.000 4.000Xcorr

Ran

k

Construction of average score distributions

11.3510,000

………

0.00073.137

0.00063.186

0.00053.25

0.00045.314

0.00035.643

0.00026.012

0.00016.711

rank/Nscorerank

Take all (best) scores from search againstinverted database; sort by decreasing scoreand calculate normalized rank

N=

Accumulated frequencydistribution

Score distribution


Construction of average score distributions

11.3510,000

………

0.00073.137

0.00063.186

0.00053.25

0.00045.314

0.00035.643

0.00026.012

0.00016.711

rank/Nscorerank

N=

Classify scores in categories (bins)Construct histogramObtain the curve

0

1

2

3

4

5

6

7

8

0 1 2 3 4Xcorr

Prob

abiil

ty d

ensi

ty

Probability densitydistribution

0

1

2

3

4

5

6

7

8

0 1 2 3 4Xcorr

Prob

abiil

ty d

ensi

ty

0

1

2

3

4

5

6

7

8

0 1 2 3 4Xcorr

Prob

abiil

ty d

ensi

ty

Probability densitydistribution


0

0.2

0.4

0.6

0.8

1

0.000 1.000 2.000 3.000 4.000Xcorr

Prob

abili

ty (r

ank/

N)

0

1

2

3

4

5

6

7

8

0 1 2 3 4Xcorr

Prob

abiil

ty d

ensi

ty

Statistical interpretationof average score distributions

Probability*Area:Probability*

Total areaUnder the curve = 1

•Probability of observing an score equal or better than x in the experiment•This is not the p-value!!!

x x


Score

Prob

abili

tyD

ensi

ty p-threshold (FP/F)

TP

FP

False positives

True positives

FDR=FP/(TP+FP)

Average score distributions: Probability and False Discovery Rate


How to calculate FDR (II)

• When there is more than one score parameter, thesituation is more complex; we have to integrate all theparameters into a single one

• If the scores are truly independent, we could simplymultiply the probabilities

• SEQUEST yields several score parameters (Xcorr, ΔCn, RSp,…) that are not independent

• Existing methods only differ in the way they integrate thescores


SEQUEST: the best score (Xcorr)and the delta score (ΔCn)

Ranking of Peptide Sequences1 2 3 4 5 6 7 8 9

Scor

ebest score: evaluateshow good isthe best match

delta score: evaluateshow much the best scoredeviates from randombehavior

The second best scorealso evaluates deviationof the best score

Random matchingbehaviour




SCO

RES

SCO

RES


Integrated Score

Average score distributions: Probability and False Discovery Rate

Prob

abili

tyD

ensi

ty p-threshold (FP/F)

TP

FP

False positives

True positives

FDR=FP/(TP+FP)


Use of random (decoy) databases tocalculate FDR

• Not all methods use decoy databases. However, this isvery robust and widely-accepted, and hence is themethod of choice

• Construct an identical database by reversing the orderof amino acid sequences of each protein (C-terminusbecomes N-terminus and viceversa), or by reversing theamino acid sequences of each peptide, maintaining N-terminal basic ends (pseudoreversal)

• Search against target and decoy databases• Use results from the decoy database to estimate the

number of false positives at a given score(s) thresholds

Peng et al., 2003; Strittmater et al., 2004; Cargile et al., 2004; Qian et al., 2005;Elias and Gygi, 2007


•Randomizeddatabases do notbehave identically andoverestimate error rates (repetition ofdomains and/or motifsin the target databasereduce the total number of uniquesequences)

Elias and Gygi, 2007

Inverted or Randomized databases?


The two methods to estimate FDRusing decoy databases

• Using separate database searches: – Search against target and decoy databases and select tentative identifications

by using the same criteria:• D = number of peptides identified in the decoy dbase (falses)• T = number of peptides identified in the target dbase (total)

– Calculate proportion of falses: FDR = D / T

• Using a concatenated database:– Construct a composite database by joining the target and decoy databases– Search against the composite database and count up how many matches were

made against the decoy database– The number of false positive assignations is estimated by doubling up the

matchs against the decoy database:• FDR = 2*D / (D+T)

• The first method was demonstrated to overestimate the number of falsepositives (good MS/MS spectra produce high scores in the decoydatabase). The second method avoids this effect by allowing target anddecoy sequences to compete for best scores.


A refined (integrated) estrategy to calculateFDR from decoy databases

0

1

2

3

4

0 1 2 3 4 5 6

Xcorr (target)

Xcor

r (d

ecoy

)

202 (171)33 (21)

122 (96)

1666 (1613)

tbdbdo

totu

du

Navarro and Vázquez (J. Proteome Res. 2009 in the press)

)1(totbdbdotbdbFDRSD ++

++=

( ) )2(2totbdbdo

dbdoFDRCD ++++

=

)3(2totbdb

dbdoFDRTD +++

=


The refined estrategy is more sensitive

)1(totbdbdotbdbFDRSD ++

++=

( ) )2(2totbdbdo

dbdoFDRCD ++++

= )3(2totbdb

dbdoFDRTD +++

=

0

0.1

0.2

0 0.1 0.2

FDRTD

FDR

SD o

r FD

RC

DFD

RSD

orFD

RC

D

0

0.1

0.2

0 0.1 0.2FDRTD

FD

R SD o

r FD

RCD

FDR

SDor

FDR

CD

0

0.1

0.2

0 0.1 0.2

FDRTD

FDR

SD o

r FD

RC

DFD

RS

Dor

FDR

CD

FDRTDFDRTDFDRTD

SEQUEST Xcorr Mascot Prob.Ratio


Statistics and peptide identification: the key points

• Statistical significance of peptide identification from a single MS/MS spectrum is given by the E-value

• E-values are calculated from single-spectrum scoredistributions. These distributions are determinedempirically or using theoretical models

• In large-scale peptide identification experiments we mustuse the FDR criterion

• FDR can be calculated from the average scoredistributions and/or by searching against a decoy(inverted) database


Statistics and Quantification

• All relative quantification approaches(arrays, 2DE, DIGE, isotope labeling, label-free…) use the same statistical background:– A null hypothesis (NH) model (no expression

changes) is established– A expression change is considered significant

on the basis of the NH probability:• The probability that the change can be explained

by the null hypothesis


The three major problems in quantitative experiments

1.- Establishing a NH model adequate forthe experimental approach

2.- Checking that the NH model can be applied to our particular experiment

3.- Determining statistical significance on thebasis of a multiple-hypothesis testingapproach


Gaussian distribution is the mostcommon NH model

• Central Limit Theorem establishes that the mean of a large number of determinations has a normal distribution

• Expression ratios are affected by multiplicativefactors

• Ratios in a log scale make the factors additive• Hence the most probable model for log(Ratios)

is a gaussian distribution• Caution: a gaussian model implies a constant

variance!!


1.- Validity of the NH model for theexperimental approach

• It is usually assumed a priori (use ofcommercial software packages). A normal distribution is almost always used

• NH model validated in only a very limitednumber of cases

• NH models from arrays or other expapproaches often extrapolated to othersituations


2.- Checking that the NH is valid forour own experiment

• Only a very limited numbers of papers testvalidity of the employed NH model

• Validity of the NH model may be tested in the same experiment or preferably makinga test experiment


3.- Multiple hypothesis testing

• We detect several expression changes (not onlyone); hence we are making a multiple check ofthe NH hypothesis

• Example: we detect 100 expression changesamong 1.000 quantified proteins with p<0.05. How many are expected to belong to the NH (i.e. to be false)? (taken from a recent MCP paper)

• When we study more than one expressionchange we must use the FDR instead of theprobability!!


3.- Multiple hypothesis testing: calculation of the FDR

1.- Establish the NH hypothesis (i.e. a normal distribution)2.- Calculate the p(NH) for each one of the proteins3.- Establish a p-threshold and calculate the expected

number of falses=Np4.- Count up how many positives we observe below the p-

threshold (O)5.- Calculate FDR=Np/O6.- Change p, recalculate FDR and iterate until the desired

FDR is obtained

Basic Statistical Concepts - Unidad de...

Documents

Transcript of Basic Statistical Concepts - Unidad de...