Analysis of tandem mass spectra - I
Prof. William Stafford NobleGENOME 541
Intro to Computational Molecular Biology
Outline
• Tandem mass spectrometry overview• Peptide identification
– De novo– Peptide database search– Spectrum database search
• Score calibration• Statistical confidence estimation
What are the pros and cons of the de novo approach?
+ Dynamic programming can quickly find peptides that fit a given spectrum graph.
- Many real spectra don’t yield a very good spectrum graph.+ It is possible to match against a noisy spectrum by allowing missing
peaks.- If you allow lots of missing peaks, then the DP is slow and leads to
many false positive matches.- If instead you search a database, you limit the number of possible
false positives.- In practice, the de novo approach does not identify many spectra.+ The de novo approach is the only one that can find previously
unknown peptides.
0
10
20
30
40
50
60
SEQUEST theoretical spectrum
y-ion, charge 1 b-ion, charge 2
flanks
H2O lossNH3 loss
a-ion
SEQUEST cross-correlation score
• Define Ri as the scalar product of the two spectra, with one offset by i.
• The score, called XCorr, is R0 minus the average Ri for i in -75, …, 75.
75
750
1
151
1
ii
n
jijji
RRX
yxR
The X!Tandem hyperscore
Number of matched b-ions
Number of matched y-ions Boolean: Is the peak at
m/z value i a b- or y-ion?
Intensity of the peak at m/z value i
n
iiiyb PINNHyperscore
0
!!
A third approach to identifying spectra
1. De novo identification
2. Search against a database of theoretical spectra
3. Search against a database of previously observed, identified spectra
Spectrum Identification
Database: fasta file
…
SEQUESTPeptide ID list
Ab
un
da
nce
m/z
Ab
un
da
nce
m/z
Ab
un
da
nce
m/z
>MEKK1 (kinase)MDRILARMKKSTRRGGDKNITPVRRLERR…
>ATMKK5 (kinase kinase)MKPIQSPSGVASPMKNRLRKRPDLSPPLPHRDVALAVLP…
MS/MS query spectra
ID proteins from
peptides…
Scan1 0.7 EGSSDEEVP…Scan1 0.3 TFAEILNPI…Scan1 0.2 ARFDLNNHD…-------------------Scan2 0.5 EDEESIRAV…Scan2 0.2 WLGDDCFMV…Scan2 0.1 IDRAAWKAV…-------------------Scan3 0.2 EITTRDMGN…Scan3 0.1 GRNMCTAKL…
BiblioSpec
Ab
un
da
nce
m/z
Ab
un
da
nce
m/z
Ab
un
da
nce
m/z
3 NGISLTIVR
3 QWDKEPPR
2 FMACSDEK
Ab
un
da
nce
m/z
Ab
un
da
nce
m/z
1 CGCCLYNT
2 GDTIENFK
Library of identified spectra
Spectrum Identification
SEQUESTPeptide ID list
Ab
un
da
nce
m/z
Ab
un
da
nce
m/z
Ab
un
da
nce
m/z
MS/MS query spectra
Scan1 0.7 EGSSDEEVP…Scan1 0.3 TFAEILNPI…Scan1 0.2 ARFDLNNHD…-------------------Scan2 0.5 EDEESIRAV…Scan2 0.2 WLGDDCFMV…Scan2 0.1 IDRAAWKAV…-------------------Scan3 0.2 EITTRDMGN…Scan3 0.1 GRNMCTAKL…
BiblioSpec
Ab
un
da
nce
m/z
Ab
un
da
nce
m/z
Ab
un
da
nce
m/z
3 NGISLTIVR
3 QWDKEPPR
2 FMACSDEK
Ab
un
da
nce
m/z
Ab
un
da
nce
m/z
1 CGCCLYNT
2 GDTIENFK
Library of identified spectra765.1
940.4
593.9
300.4
522.3
m/z 594.2
score = 0.2
What are the pros and cons of library searching?
+ Because the spectrum library contains peak intensity information, matching can be done accurately.
+ Library searching is faster than database searching.
J Library searching can only identify peptides that have been previously identified.
Database search toolsGreylagMASCOTMS-Tag (ProteinProspector)MassmatrixMyriMatchOMSSAOlavPepfrag (Prowl)PepprobePepsplicePfindPhenyxProLuCID (YADA)ProbIDRAId_DbSSEQUESTSpectrumMillVEMSX!Tandem
Landscape of Peptide Identification Software
Spectral matching toolsBibliospecSpectraSTX! P3
De novo sequencing toolsLutefiskPEAKSPepNovoSequit
Sequence tag/hybrid approachesByOnic/LookupPeaksGutenTagInspectParagonPopitam
OthersProteinlynxSonarMS/MS (knexus)Xproteo
Identifying spectra requires solving two distinct problems well
PSMs
Task 1: Ranking candidate peptides with respect to a single spectrum
Task 2: Ranking PSMs with respect to one another
Estimating a p-value can improve calibration
• The probability of observing a score >4 is the area under the curve to the right of 4.
X!Tandem and Comet use a log linear fit
XCorr
log(
coun
t)
XCorr of 5 corresponds to
count of 10-5
(Eng J Proteome Research 2009)
Calibration improves statistical power to identify spectra
Calibrated
Uncalibrated
FDR threshold
Iden
tifie
d sp
ectr
a
Spectrum identification must account for two types of multiple testing
MSEDEIER
VDPSSWFNN
CSSSTEAEQR
CIVGLTK
QFIDFSTVFQP
ISLSGK
ALNDVGK
Minimump-value
Falsediscoveryrate
Decoy peptides can be used to estimate FDR.
MSEDEIER
ISLSGK
CSSSTEAEQR
CIVGLTK
ALNDVGK
Search
DecoyTarget
MSEDEIER 2.2
ISLSGK 1.6
CSSSTEAEQR 1.9
CIVGLTK 2.8
ALNDVGK 2.7
VDPSSWFNN 1.2
QFIDFSTVFQP 1.7 FDR = 1/4 = 25%
Elias & Gygi Nat Biotech 2007
stage 1 square root
XCorr can be calculated as a dot product between two spectra
Observed spectrum
stage 2 normalize regions
stage 3 cross-correlation pre-processing
Theoretical spectrum
VNIQEELGK
for each peptide bond: b iony ionneutral losses
dot product
XCorr score(Eng J Proteome Research 2008)
XCorr can be refactored to make the theoretical spectrum binary
stage 3
observed spectrum theoretical spectrum
fingerprint of b / y / neutral lossescentered at mi = 347
vector of cleavage evidence
sum of evidence forcleavage at mi = 347
VNIQEELGK
binary markers of backbone cleavage
dot product
refactored Xcorr score(Howbert Molecular & Cellular Proteomics 2014)
Aa mmEsms
i iamCC ,,
amam mEmmEs CC 110,62,
30mE
30s
Distribution of XCorr scores can be computed using dynamic programming
Each column holds a score distribution for all peptides with a given mass
O( m ( smax – smin ) | AA | )
~ 1 sec for m = 1500
XCorr p-values must be corrected for multiple testing
• Sidak correction is similar to Bonferroni
• Accounts for fact that database search considers many peptides for each spectrum
)())(1(1)( xnpxpxp nSidak
The resulting p-values are distributed uniformly
log(rank p-value)
log
(p-v
alu
e)
p-v
alu
e
Frequency
Top Related