Deriving statistical models for predicting MS/MS product ion intensities
description
Transcript of Deriving statistical models for predicting MS/MS product ion intensities
1
Deriving statistical models for predicting MS/MS product ion intensities
Terry Speed & Frédéric Schütz
Division of Genetics & Bioinformatics The Walter and Eliza Hall Institute of Medical Research
In collaboration with the Joint ProteomicS Laboratory (WEHI/LICR)
2
Introduction
• Proteomics is critical to our understanding of cellular biological processes
• Mass Spectrometry (MS) has emerged as a key platform in proteomics for the high-throughput identification of proteins
• Sophisticated algorithms, such as Mascot or Sequest, exist for database searching of MS/MS data
• Major bottleneck: results must often be manually validated• More robust algorithms are needed before the
identification of MS/MS data can be fully automated
3m/z
Ionisation
molecular weight = 600 Daabundance = 50 %
molecular weight = 400 Daabundance = 20 %
molecular weight = 300 Daabundance = 30 % 601401301
Detection
What is a Mass Spectrometer ?
50
3020
Separation
++
++
+
+
+
+++
+ +
+++ +
+
++
+ ++
+
++
+ +++
+++
++
+++ ++
+
“An analytical device that determines the molecular weight of chemical compounds by separating molecular ions according to theirmass-to-charge ratio (m/z)”
by m/z
4m/z
Ionisation
molecular weight = 600 Daabundance = 50 %
molecular weight = 400 Daabundance = 20 %
molecular weight = 300 Daabundance = 30 % 601401301
Detection
50
30
10
Separation
++
++
+
+
++
+++
+ +
+++ ++
+
+++
+ ++
+
++
+++
++
201
by m/z
20
++++++ +
+ +++
++
5
Tandem MS (MS/MS)To gain structural information about the detected masses:
– different molecules of the same substance can split in different ways.– in each molecule, only the pieces that retain one of the charges will
be observed and present in the spectrum; the others are discarded.
+ +
+ +collision...
separation &detection
+ +++ with a gas Second MS
one productis selected
6
How to use MS for protein identification
Peptide mass fingerprinting
• The exact protein needs to be in the database• Works only with single protein fragmentations
2D-GEL DIGESTEXCISE
Proteins Sample
MS
m/z
Example: peaks at m/z 333, 336, 406, 448, 462, 889 The only protein in the database that would produce these peaks is
MALK|CGIR|GGSRPFLR|ATSK|ASR|SDD
7
2-D Gel(or 1-D Gel)
In-gel Digest(Trypsin)
MS Analysis(ESI Ion Trap)
MS data
Capillary Column
RP-HPLC(On-line;
60min Gradient)
++++
++++---
-- - - ++
++ ++
+++
+ +---
-- --
OriginalDroplet
SolventEvaporates
From Droplet
PositiveIons
CID
m/z m/z
(Mostintense
ion)
MS/MS data
Tandem MS for protein identification
CID = Collision Induced Dissociation
8150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Rel
ativ
e A
bund
ance
205.0
219.0
247.0
248.1
262.1
304.0
305.1
391.1
417.2
418.1
506.2
530.2
619.2
645.3
732.2
774.4
789.3
789.3
889.4
936.4
937.4
318.1372.2 431.1 468.4
904.5
y8
y7
y6
y5
y4
y3
y2
b2
b3
b4
b5
b6
b7
b8
a2
Glu Asp Lxx Lxx Gly Phe
Val Phe Gly Lxx Lxx Asp Glu Asp Lysb8b7b6b5b2 b4b3
y8 y7 y6 y5 y4 y3 y2
Example MS/MS spectrum
Tryptic fragment:
9
Interpretation of MS/MS data • Direct interpretation ("de novo sequencing")
– spectrum must be of good quality– the only identification method if the spectrum is not in the database– can give useful information (partial sequence) for database search
• General approach for database searching:– extract from the database all peptides that have the same mass as
the precursor ion of the uninterpreted spectrum– compare each of them them to the uninterpreted spectrum– select the peptide that is most likely to have produced the
observed data
• MASCOT: – simple probabilistic model – calculate the probability that a peptide could have produced the
given spectrum by chance
10
Interpretation of MS/MS data • SEQUEST:
– generate a predicted spectrum for each potential peptide using a simple fragmentation model (all b and y ions have the same intensity; possible losses from b and y have a lower intensity)
– compute a "cross-correlation" score and find the best-matching peptide
– since this operation is very time-consuming, a simpler preliminary score is used to find the 500 peptides in the database that are most likely to be the correct identification
11
MASCOT
correct sequence is the 2nd scoring
peptide
SEQUEST
correct sequence is not in the top 10 scoring peptides
300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000
m/z
0
50
100
Rel
ativ
e A
bund
ance
VLSIGDGIAR
y4
y4
An unusual spectrum
12
Intermediate conclusions • All current MS/MS database search algorithms use a
simplified fragmentation model: "peptides fragment in an uniform manner under low-energy
collision induced dissociation (CID) conditions"
• This approach works well for identifying most peptides• Several peptides exhibit fragment ions that differ greatly
from this simple model• Those peptides often yield low or insignificant scores,
thus preventing a positive identification
• A better understanding of the fragmentation of peptides in the gas phase is required to build more robust search engines.
13
How does a peptide fragment ? • Peptides usually fragment at their amide (=peptide) bond,
producing b and y ions• ‘mobile proton’ hypothesis: cleavage is initiated by
migration of the charge from the initial site of protonation• aRginine (a very basic residue) can sequester a proton
• Other basic residues (Lys, His) can hinder proton mobility• If no mobile proton is available:
– peptide will usually fragment poorly– other fragmentation mechanisms take precedence
• cleavage at Asp-Xaa = cD (which we saw two slides back)• cleavage at Glu-Xaa = cE
VLSIGDGIAR++
14
Fragmentation example
50
0
100
400 2000600 800 18001600140012001000m/z
VFIMDNCEELIPEYLNFIRox Pe
y8
y6y5y4
y9
y8
b10
b11
b11
• nP cleavage• cP cleavage
++
Pe (pyridylethyl cysteine) = loss from C; ox = metox (methionine sulfoxide) = loss from M
15
Fragmentation example, II
0
50
400 600 800 1000 1200
100
1400 1600 1800 2000
m/z
-CH3SOH
RVFIMDNCEELIPEYLNFIRox Pe
y14
b6
-Pe- (CH3SOH + Pe)
y14y11
~ ~
y6b6
MDNCE
• metox loss• Pe loss• cD cleavage• cE cleavage• nP cleavage
++
y8
y11 y10 y6
Difficult to interpret due to N- and C-terminal aRginines.
16
Factors influencing fragmentation • Some factors have been known for a long time:
– Xaa-Pro (nP) cleavage usually enhanced– Asp-Xaa (cD) enhanced when no mobile proton is available
• Several recent attempts to improve this knowledge• Concentrated only on small subsets of data
– Breci et al. • database of 168 Pro-containing peptides• analyse fragmentation at the Xaa-Pro (nP) bond• most abundant ions observed when Xaa is Val, His, Asp, Ile and Leu
– Tabb et al.• determined if residues are more likely to cleave on their N rather than
their C-terminal– Huang et al.
• analysis of 505 doubly-charged tryptic peptides• cleavage at Asp-Xaa (cD) is more prominent for peptides that also
contain an internal histidine residue
17
Find factors influencing fragmentation • Data:
– about 11,000 spectra from an Ion-Trap mass spectrometer– identified using SEQUEST– manually validated to ensure correct identification
• 5,500 unique sequences• Preliminary calculations: Cleavage Intensity Ratios (CIR)
( )∑∑
∑
= =
++
=
++
+
+= N
i
Z
z
zi
zi
Z
z
zs
zs
s
ybN
ybCIR
1 1
1
1
CIR
Average Enhanced
< 1 = 1 > 1
Cleavage Reduced
18
Quantifying the Asp-Xaa (cD) bond cleavage
Mobile Partially-Mobile Non-Mobile1+ 5.10 (126)-
2+K1
R1
0.81 (358)1.04 (316)
R1
4.96 (92)R2
3+
H1K1
K1R1
0.88 (54)0.91 (37) 3.63 (12)R3
1.31 (24)R2
2.37 (238)
1.66 (276)2.06 (301)
K1
K2
K1R1
1.63 (79)2.51 (21)
H1K1R1
K1R2
1.94 (23)2.71 (10)
H1K2R1
H1K1R2
If number of Arg residues ≥ number of charges If number of Arg residues ≥ number of charges Non-mobileNon-mobileIf number of Arg, Lys & His < number of charges If number of Arg, Lys & His < number of charges MobileMobile
otherwise they are designated otherwise they are designated Partially-mobilePartially-mobile
‘Relative Proton Mobility’ Scale
Entries: average CIR (#peptides), stratified by # basic residues
19
Influence on scoring • Already known: The charge state has an influence on
search scores• Proton mobility also influences search scores
Dashed line: Currently accepted cut-off; below not identified w/o manual interv.
20
Find factors influencing fragmentation,II• Data categorized into 9 different strata, according to
– charge state (1, 2 or 3+)– ‘relative proton mobility’ scale
• Each spectrum was individually normalised
1+ Mobile(8)
1+ Non-mobile(378)
3+ Non-mobile(20)
4+ Partially-mobile(55)
4+ Mobile(53)
3+ Mobile(284)
2+ Non-mobile(238)
1+ Partially-mobile(741)
2+ Partially-mobile(2035)
2+ Mobile(1115) 3+ Partially-mobile
(572)
21
Find factors influencing fragmentation,III
• Intensity at cleavage Xaa-Yaa is modeled by:log(intensity of the cleavage) =
baseline cleavage intensity + increase/decrease due to residue on C-term (Xaa) + increase/decrease due to residue on N-term (Yaa) + (pos) + (pos2) + log2(peptide length)
• where– intensity of the cleavage = sum of intensities of all ions (b, y, etc) produced by
cleavage at this bond– baseline cleavage intensity = average cleavage intensity if no factor has a
special effect on fragmentation– increase/decrease = indicator variables– pos = relative position of the cleavage inside the peptide (0..1)– log(peptide length) = accounts for the lower intensity, due to the normalisation
process, of a given cleavage when it occurs in a longer peptide
22
Find factors influencing fragmentation, IV
• Linear regression is performed to estimate the effect of each of these variables on the fragmentation process
• Variable selection: ensure that only variables that have a real effect on the fragmentation process are retained– for each "side" (C or N), the factor that is the closest to the
average intensity is removed from the model.In other words, one of the residues of each side is selected as the reference, the residue that "does nothing"
– backward selection is then performed to remove all variables that are not significantly different from 0 (at the 1% level)
23
How to find factors influencing frag • The regression was always significant (i.e. at least one
factor was significant)• In practice:
– the pos and log(length) terms were always retained
– in each regression, several residues were selected
24
Factors influencing fragmentation
25
Predicting ion intensities • Use the same kind of linear model as before• Fit separate models for the different types of ions that we
want to predict• Currently, only b and y ions are predicted• Influence of residues and positional factors are taken
into account for the prediction• This (and everything before) is valid only on an Ion-Trap
mass spectrometer
26
Prediction example : LEGLTDEINFLR, 1+
Observed spectrum
SEQUEST prediction
Prediction with LM
• ‘non-mobile’ peptide, which usually gives bad scores• correlation between observed and LM predicted spectrum: 0.97
27
Testing our predictions • Predictions were tested on a set of 283 peptides not
used for fitting the model• correlation between predicted and observed spectrum:
median: 0.73, interquartile range: 0.27
28
Testing our predictions, II • Worst scoring peptide (correlation = -0.19):
RAELEAK, doubly-charged
• Explanation– Most peptides in the training set are tryptic peptides– Proton will usually sit at the C-terminal of the peptide (K)– Under this assumption, y-ions are usually more intense than b-ions– Because of the miscleavage, the proton actually sits at the N-
terminal– Consequently, b-ions are more intense than y-ions– The model performs badly
• Charge localisation should be taken into account
29
Ongoing work • More known effects (e.g. charge localisation) must be
taken into account in the model, plus some interactions• Other effects, still unknown, also have an influence on
the fragmentation, and should be looked for• Predict other ion series (neutral losses, etc) • Test if the predictions can help discriminate between
correct and incorrect identifications• Build a new search algorithm that takes into account
these predictive models
30
Conclusions • Prediction of spectra is becoming feasible• Better search algorithms are expected
• The ‘relative proton mobility’ scale helps the interpretation of database search scores
• Optimized thresholds can be used for different subsets of the data
• It should improve the sensitivity and specificity of the identification process
• These are important steps towards fully automated identification of peptide MS/MS data
31
Acknowledgments
• JPSLLudwig Institute, Melbourne
– Eugene Kapp– James Eddes– Gavin Reid– Lisa Connolly– David Frecklington– Robert Moritz– Richard Simpson
• Bioinformatics, WEHI– Frédéric Schütz
• Dept. of Chemistry, Melbourne University– Richard O ’Hair
Part of this work will appear in Analytical Chemistry