Deriving statistical models for predicting MS/MS product ion intensities

1

Deriving statistical models for predicting MS/MS product ion intensities

Terry Speed & Frédéric Schütz

Division of Genetics & Bioinformatics The Walter and Eliza Hall Institute of Medical Research

In collaboration with the Joint ProteomicS Laboratory (WEHI/LICR)

2

Introduction

• Proteomics is critical to our understanding of cellular biological processes

• Mass Spectrometry (MS) has emerged as a key platform in proteomics for the high-throughput identification of proteins

• Sophisticated algorithms, such as Mascot or Sequest, exist for database searching of MS/MS data

• Major bottleneck: results must often be manually validated• More robust algorithms are needed before the

identification of MS/MS data can be fully automated

3m/z

Ionisation

molecular weight = 600 Daabundance = 50 %


molecular weight = 300 Daabundance = 30 % 601401301

Detection

What is a Mass Spectrometer ?

50

3020

Separation

++

++

+

+

+

+++

+ +

+++ +

+

++

+ ++

+

++

+ +++

+++

++

+++ ++

+

“An analytical device that determines the molecular weight of chemical compounds by separating molecular ions according to theirmass-to-charge ratio (m/z)”

by m/z

4m/z

Ionisation



molecular weight = 300 Daabundance = 30 % 601401301

Detection

50

30

10

Separation

++

++

+

+

++

+++

+ +

+++ ++

+

+++

+ ++

+

++

+++

++

201

by m/z

20

++++++ +

+ +++

++

5

Tandem MS (MS/MS)To gain structural information about the detected masses:

– different molecules of the same substance can split in different ways.– in each molecule, only the pieces that retain one of the charges will

be observed and present in the spectrum; the others are discarded.

+ +

+ +collision...

separation &detection

+ +++ with a gas Second MS

one productis selected

6

How to use MS for protein identification

Peptide mass fingerprinting

• The exact protein needs to be in the database• Works only with single protein fragmentations

2D-GEL DIGESTEXCISE

Proteins Sample

MS

m/z

Example: peaks at m/z 333, 336, 406, 448, 462, 889 The only protein in the database that would produce these peaks is

MALK|CGIR|GGSRPFLR|ATSK|ASR|SDD

7

2-D Gel(or 1-D Gel)

In-gel Digest(Trypsin)

MS Analysis(ESI Ion Trap)

MS data

Capillary Column

RP-HPLC(On-line;

60min Gradient)

++++

++++---

-- - - ++

++ ++

+++

+ +---

-- --

OriginalDroplet

SolventEvaporates

From Droplet

PositiveIons

CID

m/z m/z

(Mostintense

ion)

MS/MS data

Tandem MS for protein identification

CID = Collision Induced Dissociation

8150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Rel

ativ

e A

bund

ance

205.0

219.0

247.0

248.1

262.1

304.0

305.1

391.1

417.2

418.1

506.2

530.2

619.2

645.3

732.2

774.4

789.3

789.3

889.4

936.4

937.4

318.1372.2 431.1 468.4

904.5

y8

y7

y6

y5

y4

y3

y2

b2

b3

b4

b5

b6

b7

b8

a2

Glu Asp Lxx Lxx Gly Phe

Val Phe Gly Lxx Lxx Asp Glu Asp Lysb8b7b6b5b2 b4b3

y8 y7 y6 y5 y4 y3 y2

Example MS/MS spectrum

Tryptic fragment:

9

Interpretation of MS/MS data • Direct interpretation ("de novo sequencing")

– spectrum must be of good quality– the only identification method if the spectrum is not in the database– can give useful information (partial sequence) for database search

• General approach for database searching:– extract from the database all peptides that have the same mass as

the precursor ion of the uninterpreted spectrum– compare each of them them to the uninterpreted spectrum– select the peptide that is most likely to have produced the

observed data

• MASCOT: – simple probabilistic model – calculate the probability that a peptide could have produced the

given spectrum by chance

10

Interpretation of MS/MS data • SEQUEST:

– generate a predicted spectrum for each potential peptide using a simple fragmentation model (all b and y ions have the same intensity; possible losses from b and y have a lower intensity)

– compute a "cross-correlation" score and find the best-matching peptide

– since this operation is very time-consuming, a simpler preliminary score is used to find the 500 peptides in the database that are most likely to be the correct identification

11

MASCOT

correct sequence is the 2nd scoring

peptide

SEQUEST

correct sequence is not in the top 10 scoring peptides

300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000

m/z

0

50

100

Rel

ativ

e A

bund

ance

VLSIGDGIAR

y4

y4

An unusual spectrum

12

Intermediate conclusions • All current MS/MS database search algorithms use a

simplified fragmentation model: "peptides fragment in an uniform manner under low-energy

collision induced dissociation (CID) conditions"

• This approach works well for identifying most peptides• Several peptides exhibit fragment ions that differ greatly

from this simple model• Those peptides often yield low or insignificant scores,

thus preventing a positive identification

• A better understanding of the fragmentation of peptides in the gas phase is required to build more robust search engines.

13

How does a peptide fragment ? • Peptides usually fragment at their amide (=peptide) bond,

producing b and y ions• ‘mobile proton’ hypothesis: cleavage is initiated by

migration of the charge from the initial site of protonation• aRginine (a very basic residue) can sequester a proton

• Other basic residues (Lys, His) can hinder proton mobility• If no mobile proton is available:

– peptide will usually fragment poorly– other fragmentation mechanisms take precedence

• cleavage at Asp-Xaa = cD (which we saw two slides back)• cleavage at Glu-Xaa = cE

VLSIGDGIAR++

14

Fragmentation example

50

0

100

400 2000600 800 18001600140012001000m/z

VFIMDNCEELIPEYLNFIRox Pe

y8

y6y5y4

y9

y8

b10

b11

b11

• nP cleavage• cP cleavage

++

Pe (pyridylethyl cysteine) = loss from C; ox = metox (methionine sulfoxide) = loss from M

15

Fragmentation example, II

0

50

400 600 800 1000 1200

100

1400 1600 1800 2000

m/z

-CH3SOH

RVFIMDNCEELIPEYLNFIRox Pe

y14

b6

-Pe- (CH3SOH + Pe)

y14y11

~ ~

y6b6

MDNCE

• metox loss• Pe loss• cD cleavage• cE cleavage• nP cleavage

++

y8

y11 y10 y6

Difficult to interpret due to N- and C-terminal aRginines.

16

Factors influencing fragmentation • Some factors have been known for a long time:

– Xaa-Pro (nP) cleavage usually enhanced– Asp-Xaa (cD) enhanced when no mobile proton is available

• Several recent attempts to improve this knowledge• Concentrated only on small subsets of data

– Breci et al. • database of 168 Pro-containing peptides• analyse fragmentation at the Xaa-Pro (nP) bond• most abundant ions observed when Xaa is Val, His, Asp, Ile and Leu

– Tabb et al.• determined if residues are more likely to cleave on their N rather than

their C-terminal– Huang et al.

• analysis of 505 doubly-charged tryptic peptides• cleavage at Asp-Xaa (cD) is more prominent for peptides that also

contain an internal histidine residue

17

Find factors influencing fragmentation • Data:

– about 11,000 spectra from an Ion-Trap mass spectrometer– identified using SEQUEST– manually validated to ensure correct identification

• 5,500 unique sequences• Preliminary calculations: Cleavage Intensity Ratios (CIR)

( )∑∑

∑

= =

++

=

++

+

+= N

i

Z

z

zi

zi

Z

z

zs

zs

s

ybN

ybCIR

1 1

1

1

CIR

Average Enhanced

< 1 = 1 > 1

Cleavage Reduced

18

Quantifying the Asp-Xaa (cD) bond cleavage

Mobile Partially-Mobile Non-Mobile1+ 5.10 (126)-

2+K1

R1

0.81 (358)1.04 (316)

R1

4.96 (92)R2

3+

H1K1

K1R1

0.88 (54)0.91 (37) 3.63 (12)R3

1.31 (24)R2

2.37 (238)

1.66 (276)2.06 (301)

K1

K2

K1R1

1.63 (79)2.51 (21)

H1K1R1

K1R2

1.94 (23)2.71 (10)

H1K2R1

H1K1R2

If number of Arg residues ≥ number of charges If number of Arg residues ≥ number of charges Non-mobileNon-mobileIf number of Arg, Lys & His < number of charges If number of Arg, Lys & His < number of charges MobileMobile

otherwise they are designated otherwise they are designated Partially-mobilePartially-mobile

‘Relative Proton Mobility’ Scale

Entries: average CIR (#peptides), stratified by # basic residues

19

Influence on scoring • Already known: The charge state has an influence on

search scores• Proton mobility also influences search scores

Dashed line: Currently accepted cut-off; below not identified w/o manual interv.

20

Find factors influencing fragmentation,II• Data categorized into 9 different strata, according to

– charge state (1, 2 or 3+)– ‘relative proton mobility’ scale

• Each spectrum was individually normalised

1+ Mobile(8)

1+ Non-mobile(378)

3+ Non-mobile(20)

4+ Partially-mobile(55)

4+ Mobile(53)

3+ Mobile(284)

2+ Non-mobile(238)



2+ Mobile(1115) 3+ Partially-mobile

(572)

21

Find factors influencing fragmentation,III

• Intensity at cleavage Xaa-Yaa is modeled by:log(intensity of the cleavage) =

baseline cleavage intensity + increase/decrease due to residue on C-term (Xaa) + increase/decrease due to residue on N-term (Yaa) + (pos) + (pos2) + log2(peptide length)

• where– intensity of the cleavage = sum of intensities of all ions (b, y, etc) produced by

cleavage at this bond– baseline cleavage intensity = average cleavage intensity if no factor has a

special effect on fragmentation– increase/decrease = indicator variables– pos = relative position of the cleavage inside the peptide (0..1)– log(peptide length) = accounts for the lower intensity, due to the normalisation

process, of a given cleavage when it occurs in a longer peptide

22

Find factors influencing fragmentation, IV

• Linear regression is performed to estimate the effect of each of these variables on the fragmentation process

• Variable selection: ensure that only variables that have a real effect on the fragmentation process are retained– for each "side" (C or N), the factor that is the closest to the

average intensity is removed from the model.In other words, one of the residues of each side is selected as the reference, the residue that "does nothing"

– backward selection is then performed to remove all variables that are not significantly different from 0 (at the 1% level)

23

How to find factors influencing frag • The regression was always significant (i.e. at least one

factor was significant)• In practice:

– the pos and log(length) terms were always retained

– in each regression, several residues were selected

24

Factors influencing fragmentation

25

Predicting ion intensities • Use the same kind of linear model as before• Fit separate models for the different types of ions that we

want to predict• Currently, only b and y ions are predicted• Influence of residues and positional factors are taken

into account for the prediction• This (and everything before) is valid only on an Ion-Trap

mass spectrometer

26

Prediction example : LEGLTDEINFLR, 1+

Observed spectrum

SEQUEST prediction

Prediction with LM

• ‘non-mobile’ peptide, which usually gives bad scores• correlation between observed and LM predicted spectrum: 0.97

27

Testing our predictions • Predictions were tested on a set of 283 peptides not

used for fitting the model• correlation between predicted and observed spectrum:

median: 0.73, interquartile range: 0.27

28

Testing our predictions, II • Worst scoring peptide (correlation = -0.19):

RAELEAK, doubly-charged

• Explanation– Most peptides in the training set are tryptic peptides– Proton will usually sit at the C-terminal of the peptide (K)– Under this assumption, y-ions are usually more intense than b-ions– Because of the miscleavage, the proton actually sits at the N-

terminal– Consequently, b-ions are more intense than y-ions– The model performs badly

• Charge localisation should be taken into account

29

Ongoing work • More known effects (e.g. charge localisation) must be

taken into account in the model, plus some interactions• Other effects, still unknown, also have an influence on

the fragmentation, and should be looked for• Predict other ion series (neutral losses, etc) • Test if the predictions can help discriminate between

correct and incorrect identifications• Build a new search algorithm that takes into account

these predictive models

30

Conclusions • Prediction of spectra is becoming feasible• Better search algorithms are expected

• The ‘relative proton mobility’ scale helps the interpretation of database search scores

• Optimized thresholds can be used for different subsets of the data

• It should improve the sensitivity and specificity of the identification process

• These are important steps towards fully automated identification of peptide MS/MS data

31

Acknowledgments

• JPSLLudwig Institute, Melbourne

– Eugene Kapp– James Eddes– Gavin Reid– Lisa Connolly– David Frecklington– Robert Moritz– Richard Simpson

• Bioinformatics, WEHI– Frédéric Schütz

• Dept. of Chemistry, Melbourne University– Richard O ’Hair

Part of this work will appear in Analytical Chemistry

Deriving statistical models for predicting MS/MS product ion intensities

Documents

Transcript of Deriving statistical models for predicting MS/MS product ion intensities