Basic Statistical Concepts - Unidad de...
Transcript of Basic Statistical Concepts - Unidad de...
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Basic Statistical Concepts
Jesús VázquezLaboratorio de Química de Proteínas y ProteómicaCentro de Biología Molecular Severo Ochoa-CSIC
CórdobaEnero 2009
Basic Statistical Concepts. Jesús Vázquez. CBMSO
SCOPE hypergeometrical OMSSA
MASCOT ???OLAV (Phenyx)
SCO
RES
SCO
RES
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Scenario:
• We search one MS/MS spectra against a proteindatabase
• We want to identify in a database the peptidethat has produced the MS/MS spectrum
• The first approach is to “score” the peptide candidates and select the peptide yielding the “best score”
Basic Statistical Concepts. Jesús Vázquez. CBMSO
m/z
%R
elat
ive
Inte
n sity
Observed spectrum
m/z
%R
e lat
ive
Inte
nsity
Theoretical spectrum
Correlation scoring:How does SEQUEST work?
SEQUEST measures thedegree of correlation
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Scenario:• We search one MS/MS spectra against a protein
database• We want to determine what is the sequence of
the peptide that has produced the MS/MSspectrum
• The first approach is to “score” the peptide candidates and select the peptide yielding the “best score”
• From previous experience using such score wecan approximately infer whether the peptidecandidate is correct or not (i.e. a very high Xcorrscore)
Basic Statistical Concepts. Jesús Vázquez. CBMSO
SCOPE hypergeometrical OMSSA
MASCOT ???OLAV (Phenyx)
SCO
RES
SCO
RES
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Scenario:• We search one MS/MS spectra against a protein
database; we want to determine what is the sequence ofthe peptide that has produced the MS/MS spectrum
• If we want to determine the statistical confidenceassociated to such peptide identification, we have to use the “probability distributions” corresponding to the “null-hypothesis”
• These distributions can be constructed empirically or onthe basis of a “MS/MS-sequence matching model”(theoretically)
• The outcome is a p-value and a E-value for the peptidematch
Basic Statistical Concepts. Jesús Vázquez. CBMSO
SCOPE hypergeometrical OMSSA
MASCOT ???OLAV (Phenyx)
SCO
RES
SCO
RES
Basic Statistical Concepts. Jesús Vázquez. CBMSO
0.971.091.212.731.251.121.852.111.952.18
1.031.121.252.741.271.171.902.121.972.21
1.061.171.262.761.271.171.922.231.982.23
1.071.261.282.781.281.201.942.242.002.29
1.101.331.312.791.301.221.952.282.002.31
1.111.331.332.801.301.251.962.302.092.33
1.121.371.372.811.311.272.042.372.152.41
1.221.381.412.921.471.342.072.392.192.51
1.321.441.493.091.561.432.142.522.372.58
1.351.601.653.133.183.205.315.646.016.71
10.000…87654321Spectrum #
Sco
res
XCorr
1
21
xxx
Cn−
=Δ
SEQUEST:best score (Xcorr) and delta score (ΔCn)
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Information provided by the best score (Xcorr) and the delta score (ΔCn)
Ranking of Peptide Sequences1 2 3 4 5 6 7 8 9
Scor
ebest score: evaluateshow good isthe best match
delta score: evaluateshow much the best scoredeviates from randombehavior
The second best scorealso evaluates deviationof the best score
Random matchingbehaviour
Basic Statistical Concepts. Jesús Vázquez. CBMSO
SCOPE hypergeometrical OMSSA
MASCOT ???OLAV (Phenyx)
SCO
RES
SCO
RES
Basic Statistical Concepts. Jesús Vázquez. CBMSO
0.971.091.212.731.251.121.852.111.952.18
1.031.121.252.741.271.171.902.121.972.21
1.061.171.262.761.271.171.922.231.982.23
1.071.261.282.781.281.201.942.242.002.29
1.101.331.312.791.301.221.952.282.002.31
1.111.331.332.801.301.251.962.302.092.33
1.121.371.372.811.311.272.042.372.152.41
1.221.381.412.921.471.342.072.392.192.51
1.321.441.493.091.561.432.142.522.372.58
1.351.561.523.111.651.522.212.632.482.65
35.000…87654321Spectrum #
Sco
res
XCorr
1
21
xxx
Cn−
=Δ
Search the MS/MS spectrum against a “random”(1) database
(1) See later
Basic Statistical Concepts. Jesús Vázquez. CBMSO
0
0.2
0.4
0.6
0.8
1
0.000 1.000 2.000 3.000 4.000Xcorr
Prob
abili
ty (r
ank/
N)
0
2000
4000
6000
8000
10000
0.000 1.000 2.000 3.000 4.000Xcorr
Ran
k
Construction of probability- score distributionsfor a single MS/MS spectrum
11.3510,000
………
0.00073.137
0.00063.186
0.00053.25
0.00045.314
0.00035.643
0.00026.012
0.00016.711
rank/Nscorerank
Take all scores from search against inverteddatabase; sort by decreasing score andcalculate normalized rank
N=
Accumulated frequencydistribution
Score distribution
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Single-spectrum distributions$
and definition of p-value*• Make a database search of the
MS/MS spectrum against a verylarge collection of randomsequence candidates
• Sort the matches according to theparticular score and determine normalized rank
• * Is the probability that thespectrum matchs by chance a peptide sequence with a score equal or better than x
0
0.2
0.4
0.6
0.8
1
0.000 1.000 2.000 3.000 4.000Xcorr
Prob
abili
ty (r
ank/
N)
p-value*
xScore
$Also called survival functions by Fenyo & Beavis, 2003
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Fitting of score distributions to calculate empiricallyp-values from database search results
• Construct the frequencydistributions of all peptide scoreswhile a MS/MS spectrum issearched against a database
• Plot the distribution in a semi-logcurve
• Fit the high-scoring portion of thecurve to a linear function
• Use fitting parameters to calculatep and e-values for the scores
Original reference of TANDEM, using SONAR: Fenyo & Beavis, 2003This fitting and extrapolation procedure is justified by the properties of extreme-value (Gumbel) distributions
0.0000001
0.000001
0.00001
0.0001
0.001
0.01
0.1
0 1 2 3 4 5ra
nk/N
Score
Basic Statistical Concepts. Jesús Vázquez. CBMSO
p-value and expectation or E-value (I)
• Suppose that we get a peptide match with p=0.001 after searchingagainst N=1.000 sequence candidates: is that significant?
• The probability of getting one or more matches with p or lower whenthe event is repeated N times is equal to the probability of not havingN matches with (1-p), i.e.:
Npplowerorpwithmatchesmoreoroneprob N ≈−−= )1(1)(
• In this case Np=1, and therefore the match is not significant! In otherwords, we would expect that at least one peptide sequence wouldgive a match with p=0.001
• We need to take into account the number N of sequence candidatesto calculate statistical significance, i.e. the expectation value
Basic Statistical Concepts. Jesús Vázquez. CBMSO
p-value and expectation or E-value (II)
• Expectation or e-value: is defined as theexpected number of matches with a p-value that would be obtained when Nsequence candidates are searched:
pNE ⋅=The E-value is a very common statistical confidenceparameter in bioinformatics, and is used by database-searching programs like BLAST
Basic Statistical Concepts. Jesús Vázquez. CBMSO
p-value and expectation or E-value (III):conceptual difference
• The p-value is the probability that the MS/MSspectrum gets a given score against a randompeptide sequence
• The E-value is the probability that the MS/MSspectrum gets a given best-score against a collection of random peptide sequencecandidates
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Conclusions on probability models
• Probabilities associated to individual peptide matches are computed by statistical, theoretical or mixed models
• The final outcome is the E-value
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Scenario:
• We search a large collection of MS/MS spectraagainst a protein database
• We want a compromise between number ofidentified peptides and statistical confidence
• A) We need to control the sensitivity, accuracy, specificity… of the set of identified peptides
• B) We need to control and maintain at a reasonable level the proportion of false peptideassignations
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Present Absent
Positive Condition Present + Positive result = True Positive
Condition absent + Positive result = False Positive (Type I error)
Negative Condition present + Negative result = False (invalid) Negative (Type II error)
Condition absent + Negative result = True (accurate) Negative
Test Result
Actual condition
Performance parameters related to a “2-class prediction problem”(also called “binary classification”):
“contingency table” or “confusion matrix”
We are searching for a compromise between Type I and Type II errors
Basic Statistical Concepts. Jesús Vázquez. CBMSO
true positive rate (TPR)eqv. with hit rate, recall, sensitivity
TPR = TP / P = TP / (TP + FN)false positive rate (FPR)
eqv. with false alarm rate, fall-outFPR = FP / N = FP / (FP + TN)
accuracy (ACC)ACC = (TP + TN) / (P + N)
specificity (SPC)SPC = TN / (FP + TN) = 1 − FPR
positive predictive value (PPV)eqv. with precision
PPV = TP / (TP + FP)
threshold
TP
TNFN
FP
True assignationsFalse assignations
Note that the FPRequals to the
statistical significance(probability of having
a false positive by chance)!
Statistical concepts: sensitivity and FPR
false negative rate (FNR)FNR = FN / (TP + FN) = 1 - TPR
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Statistical concepts: FDR• False Discovery Rate (FDR) (1), or simply error
rate,– is the estimated proportion of false assignations
among the set of identified peptides
TPFPFPFDR+
=
threshold
TP
TNFN
FP
True assignationsFalse assignations
(1) Incorrectly called FPR in some works (Elias & Gygi, 2007)
Basic Statistical Concepts. Jesús Vázquez. CBMSO
threshold
TP
TNFN
FP
True assignationsFalse assignations
FDR
FPR, SPC
FNR, TPR
To calculate FNR or TPR we need to know in advance FN and TP(not possible in real world experiments)
In the practice we establish a threshold with a given FPR (stat. Significance)and calculate the associated FDR.
If not satisfactory, significance is varied until a good FDR is reached
Basic Statistical Concepts. Jesús Vázquez. CBMSO
ROC* and identification/FDR curves
0 1
1
0
FPR
TPR
0 1
1.000
0
FDRId
entif
ied
pept
ides
Test dataset Real world experiment
-to test performance ofan discriminatory algorithm
-graphical display of identificationperformance in an experiment
optimumperformance
1%
*ROC: Receiver Operating Characteristic
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Statistical concepts: FDRcommon questions
• What is the FDR if we identify 1.000 peptidesand statistically expect that 50 are false?
• How many false positives are expected if weidentify 100 peptides with a FDR of 0,1%?
• Less than 1 false peptide?? Is this statisticallycorrect? Would it not be better to use a 1% FDR in this case?
• And if we only identify 10 peptides? What FDR would we use?
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Statistical concepts: FDR and FWR
• FWR (family-wise rate) is controlled by establishing that we have less than ONE expected false positive in an experiment.
• A good criterion is to use FWR when we havelow numbers of identified peptides and FDR when the numbers are large
• Caution: low numbers of identified peptides are difficult to control with a proper statistical model.
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Statistical concepts: FDR andmultiple-hypothesis testing
• FDR is a parameter developed in the frameworkof multiple-hypothesis testing
• To identify 100 peptides is to assume that 100 individual and independent hypothesis are true
• When we say that we are making a multiple-hypothesis testing, we mean that we are usingthe FDR to control statistical significance
Basic Statistical Concepts. Jesús Vázquez. CBMSO
How to calculate FDR (I)
• To calculate FDR it is useful to determine thedistribution of best scores in the experimentwhen all matches are random (i.e., false). [Thesewere called “average-score distributions” to differentiate them from the “single-spectrum distributions” used to calculate p-value]
• The simplest case is a searching engine withonly ONE score parameter
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Average vs Single-Spectrum distributions (1)
0.971.09
1.212.731.251.121.852.111.952.18
1.031.12
1.252.741.271.171.902.121.972.21
1.061.17
1.262.761.271.171.922.231.982.23
1.071.26
1.282.781.281.201.942.242.002.29
1.101.33
1.312.791.301.221.952.282.002.31
1.111.33
1.332.801.301.251.962.302.092.33
1.121.37
1.372.811.311.272.042.372.152.41
1.221.38
1.412.921.471.342.072.392.192.51
1.321.44
1.493.091.561.432.142.522.372.58
1.351.60
1.653.133.183.205.315.646.016.71
1000…87654321
0.971.09
1.212.731.251.121.852.111.952.18
1.031.12
1.252.741.271.171.902.121.972.21
1.061.17
1.262.761.271.171.922.231.982.23
1.071.26
1.282.781.281.201.942.242.002.29
1.101.33
1.312.791.301.221.952.282.002.31
1.111.33
1.332.801.301.251.962.302.092.33
1.121.37
1.372.811.311.272.042.372.152.41
1.221.38
1.412.921.471.342.072.392.192.51
1.321.44
1.493.091.561.432.142.522.372.58
1.351.60
1.653.133.183.205.315.646.016.71
1000…87654321Spectrum #Sc
ores
XCorr
1
21
xxxCn
−=Δ
(1) As defined by Martín-Maroto (Martínez-Bartolomé et al., Mol.Cell Proteomics 2008)
Average
Single-Spectrum
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Spectrum #
Sco
res
XCorr
Construction of SEQUEST average score distributions
0.971.091.212.731.251.121.852.111.952.18
1.031.121.252.741.271.171.902.121.972.21
1.061.171.262.761.271.171.922.231.982.23
1.071.261.282.781.281.201.942.242.002.29
1.101.331.312.791.301.221.952.282.002.31
1.111.331.332.801.301.251.962.302.092.33
1.121.371.372.811.311.272.042.372.152.41
1.221.381.412.921.471.342.072.392.192.51
1.321.441.493.091.561.432.142.522.372.58
1.351.601.653.133.183.205.315.646.016.71
10000…87654321
Take the set of best scores
Basic Statistical Concepts. Jesús Vázquez. CBMSO
0
0.2
0.4
0.6
0.8
1
0.000 1.000 2.000 3.000 4.000Xcorr
Prob
abili
ty (r
ank/
N)
0
2000
4000
6000
8000
10000
0.000 1.000 2.000 3.000 4.000Xcorr
Ran
k
Construction of average score distributions
11.3510,000
………
0.00073.137
0.00063.186
0.00053.25
0.00045.314
0.00035.643
0.00026.012
0.00016.711
rank/Nscorerank
Take all (best) scores from search againstinverted database; sort by decreasing scoreand calculate normalized rank
N=
Accumulated frequencydistribution
Score distribution
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Construction of average score distributions
11.3510,000
………
0.00073.137
0.00063.186
0.00053.25
0.00045.314
0.00035.643
0.00026.012
0.00016.711
rank/Nscorerank
N=
Classify scores in categories (bins)Construct histogramObtain the curve
0
1
2
3
4
5
6
7
8
0 1 2 3 4Xcorr
Prob
abiil
ty d
ensi
ty
Probability densitydistribution
0
1
2
3
4
5
6
7
8
0 1 2 3 4Xcorr
Prob
abiil
ty d
ensi
ty
0
1
2
3
4
5
6
7
8
0 1 2 3 4Xcorr
Prob
abiil
ty d
ensi
ty
Probability densitydistribution
Basic Statistical Concepts. Jesús Vázquez. CBMSO
0
0.2
0.4
0.6
0.8
1
0.000 1.000 2.000 3.000 4.000Xcorr
Prob
abili
ty (r
ank/
N)
0
1
2
3
4
5
6
7
8
0 1 2 3 4Xcorr
Prob
abiil
ty d
ensi
ty
Statistical interpretationof average score distributions
Probability*Area:Probability*
Total areaUnder the curve = 1
•Probability of observing an score equal or better than x in the experiment•This is not the p-value!!!
x x
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Score
Prob
abili
tyD
ensi
ty p-threshold (FP/F)
TP
FP
False positives
True positives
FDR=FP/(TP+FP)
Average score distributions: Probability and False Discovery Rate
Basic Statistical Concepts. Jesús Vázquez. CBMSO
How to calculate FDR (II)
• When there is more than one score parameter, thesituation is more complex; we have to integrate all theparameters into a single one
• If the scores are truly independent, we could simplymultiply the probabilities
• SEQUEST yields several score parameters (Xcorr, ΔCn, RSp,…) that are not independent
• Existing methods only differ in the way they integrate thescores
Basic Statistical Concepts. Jesús Vázquez. CBMSO
SEQUEST: the best score (Xcorr)and the delta score (ΔCn)
Ranking of Peptide Sequences1 2 3 4 5 6 7 8 9
Scor
ebest score: evaluateshow good isthe best match
delta score: evaluateshow much the best scoredeviates from randombehavior
The second best scorealso evaluates deviationof the best score
Random matchingbehaviour
Basic Statistical Concepts. Jesús Vázquez. CBMSO
SCOPE hypergeometrical OMSSA
MASCOT ???OLAV (Phenyx)
SCO
RES
SCO
RES
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Integrated Score
Average score distributions: Probability and False Discovery Rate
Prob
abili
tyD
ensi
ty p-threshold (FP/F)
TP
FP
False positives
True positives
FDR=FP/(TP+FP)
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Use of random (decoy) databases tocalculate FDR
• Not all methods use decoy databases. However, this isvery robust and widely-accepted, and hence is themethod of choice
• Construct an identical database by reversing the orderof amino acid sequences of each protein (C-terminusbecomes N-terminus and viceversa), or by reversing theamino acid sequences of each peptide, maintaining N-terminal basic ends (pseudoreversal)
• Search against target and decoy databases• Use results from the decoy database to estimate the
number of false positives at a given score(s) thresholds
Peng et al., 2003; Strittmater et al., 2004; Cargile et al., 2004; Qian et al., 2005;Elias and Gygi, 2007
Basic Statistical Concepts. Jesús Vázquez. CBMSO
•Randomizeddatabases do notbehave identically andoverestimate error rates (repetition ofdomains and/or motifsin the target databasereduce the total number of uniquesequences)
Elias and Gygi, 2007
Inverted or Randomized databases?
Basic Statistical Concepts. Jesús Vázquez. CBMSO
The two methods to estimate FDRusing decoy databases
• Using separate database searches: – Search against target and decoy databases and select tentative identifications
by using the same criteria:• D = number of peptides identified in the decoy dbase (falses)• T = number of peptides identified in the target dbase (total)
– Calculate proportion of falses: FDR = D / T
• Using a concatenated database:– Construct a composite database by joining the target and decoy databases– Search against the composite database and count up how many matches were
made against the decoy database– The number of false positive assignations is estimated by doubling up the
matchs against the decoy database:• FDR = 2*D / (D+T)
• The first method was demonstrated to overestimate the number of falsepositives (good MS/MS spectra produce high scores in the decoydatabase). The second method avoids this effect by allowing target anddecoy sequences to compete for best scores.
Basic Statistical Concepts. Jesús Vázquez. CBMSO
A refined (integrated) estrategy to calculateFDR from decoy databases
0
1
2
3
4
0 1 2 3 4 5 6
Xcorr (target)
Xcor
r (d
ecoy
)
202 (171)33 (21)
122 (96)
1666 (1613)
tbdbdo
totu
du
Navarro and Vázquez (J. Proteome Res. 2009 in the press)
)1(totbdbdotbdbFDRSD ++
++=
( ) )2(2totbdbdo
dbdoFDRCD ++++
=
)3(2totbdb
dbdoFDRTD +++
=
Basic Statistical Concepts. Jesús Vázquez. CBMSO
The refined estrategy is more sensitive
)1(totbdbdotbdbFDRSD ++
++=
( ) )2(2totbdbdo
dbdoFDRCD ++++
= )3(2totbdb
dbdoFDRTD +++
=
0
0.1
0.2
0 0.1 0.2
FDRTD
FDR
SD o
r FD
RC
DFD
RSD
orFD
RC
D
0
0.1
0.2
0 0.1 0.2FDRTD
FD
R SD o
r FD
RCD
FDR
SDor
FDR
CD
0
0.1
0.2
0 0.1 0.2
FDRTD
FDR
SD o
r FD
RC
DFD
RS
Dor
FDR
CD
FDRTDFDRTDFDRTD
SEQUEST Xcorr Mascot Prob.Ratio
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Statistics and peptide identification: the key points
• Statistical significance of peptide identification from a single MS/MS spectrum is given by the E-value
• E-values are calculated from single-spectrum scoredistributions. These distributions are determinedempirically or using theoretical models
• In large-scale peptide identification experiments we mustuse the FDR criterion
• FDR can be calculated from the average scoredistributions and/or by searching against a decoy(inverted) database
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Statistics and Quantification
• All relative quantification approaches(arrays, 2DE, DIGE, isotope labeling, label-free…) use the same statistical background:– A null hypothesis (NH) model (no expression
changes) is established– A expression change is considered significant
on the basis of the NH probability:• The probability that the change can be explained
by the null hypothesis
Basic Statistical Concepts. Jesús Vázquez. CBMSO
The three major problems in quantitative experiments
1.- Establishing a NH model adequate forthe experimental approach
2.- Checking that the NH model can be applied to our particular experiment
3.- Determining statistical significance on thebasis of a multiple-hypothesis testingapproach
Basic Statistical Concepts. Jesús Vázquez. CBMSO
Gaussian distribution is the mostcommon NH model
• Central Limit Theorem establishes that the mean of a large number of determinations has a normal distribution
• Expression ratios are affected by multiplicativefactors
• Ratios in a log scale make the factors additive• Hence the most probable model for log(Ratios)
is a gaussian distribution• Caution: a gaussian model implies a constant
variance!!
Basic Statistical Concepts. Jesús Vázquez. CBMSO
1.- Validity of the NH model for theexperimental approach
• It is usually assumed a priori (use ofcommercial software packages). A normal distribution is almost always used
• NH model validated in only a very limitednumber of cases
• NH models from arrays or other expapproaches often extrapolated to othersituations
Basic Statistical Concepts. Jesús Vázquez. CBMSO
2.- Checking that the NH is valid forour own experiment
• Only a very limited numbers of papers testvalidity of the employed NH model
• Validity of the NH model may be tested in the same experiment or preferably makinga test experiment
Basic Statistical Concepts. Jesús Vázquez. CBMSO
3.- Multiple hypothesis testing
• We detect several expression changes (not onlyone); hence we are making a multiple check ofthe NH hypothesis
• Example: we detect 100 expression changesamong 1.000 quantified proteins with p<0.05. How many are expected to belong to the NH (i.e. to be false)? (taken from a recent MCP paper)
• When we study more than one expressionchange we must use the FDR instead of theprobability!!
Basic Statistical Concepts. Jesús Vázquez. CBMSO
3.- Multiple hypothesis testing: calculation of the FDR
1.- Establish the NH hypothesis (i.e. a normal distribution)2.- Calculate the p(NH) for each one of the proteins3.- Establish a p-threshold and calculate the expected
number of falses=Np4.- Count up how many positives we observe below the p-
threshold (O)5.- Calculate FDR=Np/O6.- Change p, recalculate FDR and iterate until the desired
FDR is obtained