Protein Sequencing and Identification by Mass Spectrometry.

55
Protein Sequencing and Identification by Mass Spectrometry

Transcript of Protein Sequencing and Identification by Mass Spectrometry.

Page 1: Protein Sequencing and Identification by Mass Spectrometry.

Protein Sequencing and

Identification by Mass

Spectrometry

Page 2: Protein Sequencing and Identification by Mass Spectrometry.

Masses of Amino Acid Residues

Page 3: Protein Sequencing and Identification by Mass Spectrometry.

Protein Backbone

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

Page 4: Protein Sequencing and Identification by Mass Spectrometry.

Peptide Fragmentation

• Peptides tend to fragment along the backbone.• Fragments can also loose neutral chemical groups

like NH3 and H2O.

H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

H+

Prefix Fragment Suffix Fragment

Collision Induced Dissociation

Page 5: Protein Sequencing and Identification by Mass Spectrometry.

Breaking Protein into Peptides and Peptides into Fragment Ions• Proteases, e.g. trypsin, break protein into

peptides.• A Tandem Mass Spectrometer further breaks

the peptides down into fragment ions and measures the mass of each piece.

• Mass Spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones.

• Mass Spectrometer measure mass/charge ratio of an ion.

Page 6: Protein Sequencing and Identification by Mass Spectrometry.

N- and C-terminal Peptides

N-term

inal

pep

tides

C-te

rmin

al p

eptid

es

Page 7: Protein Sequencing and Identification by Mass Spectrometry.

Terminal peptides and ion types

Peptide

Mass (D) 57 + 97 + 147 + 114 = 415

Peptide

Mass (D) 57 + 97 + 147 + 114 – 18 = 397

without

Page 8: Protein Sequencing and Identification by Mass Spectrometry.

N- and C-terminal Peptides

N-term

inal

pep

tides

C-te

rmin

al p

eptid

es

415

486

301

154

57

71

185

332

429

Page 9: Protein Sequencing and Identification by Mass Spectrometry.

N- and C-terminal Peptides

N-term

inal

pep

tides

C-te

rmin

al p

eptid

es

415

486

301

154

57

71

185

332

429

Page 10: Protein Sequencing and Identification by Mass Spectrometry.

N- and C-terminal Peptides

415

486

301

154

57

71

185

332

429

Page 11: Protein Sequencing and Identification by Mass Spectrometry.

N- and C-terminal Peptides

415

486

301

154

57

71

185

332

429

Reconstruct peptide from the set of masses of fragment ions

(mass-spectrum)

Page 12: Protein Sequencing and Identification by Mass Spectrometry.

Peptide Fragmentation

y3

b2

y2 y1

b3a2 a3

HO NH3+

| |

R1 O R2 O R3 O R4

| || | || | || |H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H

b2-H2O

y3 -H2O

b3- NH3

y2 - NH3

Page 13: Protein Sequencing and Identification by Mass Spectrometry.

Mass Spectra

G V D L K

mass0

57 Da = ‘G’ 99 Da = ‘V’LK D V G

• The peaks in the mass spectrum:• Prefix • Fragments with neutral losses (-H2O, -NH3)

• Noise and missing peaks.

and Suffix Fragments.

D

H2O

Page 14: Protein Sequencing and Identification by Mass Spectrometry.

Protein Identification with MS/MS

G V D L K

mass0

Inte

nsity

mass0

MS/MSPeptide Identification:

Page 15: Protein Sequencing and Identification by Mass Spectrometry.

Tandem Mass-Spectrometry

Page 16: Protein Sequencing and Identification by Mass Spectrometry.

Breaking Proteins into Peptides

peptides

MPSER……

GTDIMRPAKID

……

HPLCTo MS/MSMPSERGTDIMRPAKID......

protein

Page 17: Protein Sequencing and Identification by Mass Spectrometry.

Mass SpectrometryMatrix-Assisted Laser Desorption/Ionization (MALDI)

From lectures by Vineet Bafna (UCSD)

Page 18: Protein Sequencing and Identification by Mass Spectrometry.

Tandem Mass SpectrometryRT: 0.01 - 80.02

5 10 15 20 253 035 40 45 50 55 60 65 70 75 80Time (min)

0

10

20

30

40

50

60

70

80

90

100

Relati

ve Ab

undanc

e

13891991

1409 21491615 1621

14112147

161119951655

15931387

21551435 19872001 21771445 1661

19372205

1779 21352017

1313 22071307 23291105 17071095

2331

NL:1.52E8

Base Peak F: + c Full ms [ 300.00 - 2000.00]

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Rel

ativ

e A

bund

ance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

Scan 1708

LC

S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7F: + c Full ms [ 300.00 - 2000.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Rel

ativ

e Ab

unda

nce

638.0

801.0

638.9

1173.8872.3 1275.3

687.6944.7 1884.51742.1122.0783.3 1048.3 1413.9 1617.7

Scan 1707

MS

MS/MSIon

Source

MS-1collision

cell MS-2

Page 19: Protein Sequencing and Identification by Mass Spectrometry.

Protein Identification by Tandem Mass SpectrometrySSeeqquueennccee

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Rel

ativ

e Ab

unda

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

MS/MS instrumentMS/MS instrument

Database search•Sequestde Novo interpretation•Sherenga

Page 20: Protein Sequencing and Identification by Mass Spectrometry.

Tandem Mass Spectrum• Tandem Mass Spectrometry (MS/MS): mainly

generates partial N- and C-terminal peptides • Spectrum consists of different ion types

because peptides can be broken in several places.

• Chemical noise often complicates the spectrum.

• Represented in 2-D: mass/charge axis vs. intensity axis

Page 21: Protein Sequencing and Identification by Mass Spectrometry.

De Novo vs. Database Search S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

WR

A

C

VG

E

K

DW

LP

T

L T

WR

A

C

VG

E

K

DW

LP

T

L T

De Novo

AVGELTK

Database Search

Database of all peptides = 20n

AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI,

AVGELTI, AVGELTK , AVGELTL, AVGELTM,

YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Mass, Score

Page 22: Protein Sequencing and Identification by Mass Spectrometry.

De Novo vs. Database Search: A Paradox

• The database of all peptides is huge ≈ O(20n) .

• The database of all known peptides is much smaller ≈ O(108).

• However, de novo algorithms can be much faster, even though their search space is much larger!

• A database search scans all peptides in the database of all known peptides search space to find best one.

• De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search.

Page 23: Protein Sequencing and Identification by Mass Spectrometry.

De novo Peptide Sequencing

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Rel

ativ

e A

bund

ance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

SequenceSequence

Page 24: Protein Sequencing and Identification by Mass Spectrometry.

Theoretical Spectrum

Page 25: Protein Sequencing and Identification by Mass Spectrometry.

Theoretical Spectrum (cont’d)

Page 26: Protein Sequencing and Identification by Mass Spectrometry.

Theoretical Spectrum (cont’d)

Page 27: Protein Sequencing and Identification by Mass Spectrometry.

Building Spectrum Graph• How to create vertices (from masses)

• How to create edges (from mass differences)

• How to score paths

• How to find best path

Page 28: Protein Sequencing and Identification by Mass Spectrometry.

S E Q U E N C E

b

Mass/Charge (M/Z)Mass/Charge (M/Z)

Page 29: Protein Sequencing and Identification by Mass Spectrometry.

a

Mass/Charge (M/Z)Mass/Charge (M/Z)

S E Q U E N C E

Page 30: Protein Sequencing and Identification by Mass Spectrometry.

S E Q U E N C E

Mass/Charge (M/Z)Mass/Charge (M/Z)

a is an ion type shift in b

Page 31: Protein Sequencing and Identification by Mass Spectrometry.

y

Mass/Charge (M/Z)Mass/Charge (M/Z)

E C N E U Q E S

Page 32: Protein Sequencing and Identification by Mass Spectrometry.

Mass/Charge (M/Z)Mass/Charge (M/Z)

Inte

nsit

yIn

tens

ity

Page 33: Protein Sequencing and Identification by Mass Spectrometry.

Mass/Charge (M/Z)Mass/Charge (M/Z)

Inte

nsit

yIn

tens

ity

Page 34: Protein Sequencing and Identification by Mass Spectrometry.

noise

Mass/Charge (M/Z)Mass/Charge (M/Z)

Page 35: Protein Sequencing and Identification by Mass Spectrometry.

MS/MS Spectrum

Mass/Charge (M/z)Mass/Charge (M/z)

Inte

nsit

yIn

tens

ity

Page 36: Protein Sequencing and Identification by Mass Spectrometry.

Some Mass Differences between Peaks Correspond to Amino Acids

ss

ssss

ee

eeee

ee

ee

ee

ee

ee

qq

qq

qquu

uu

uu

nn

nn

nn

ee

cc

cc

cc

Page 37: Protein Sequencing and Identification by Mass Spectrometry.

Ion Types• Some masses correspond to fragment

ions, others are just random noise

• Knowing ion types Δ={δ1, δ2,…, δk} lets us

distinguish fragment ions from noise

• We can learn ion types δi and their

probabilities qi by analyzing a large test

sample of annotated spectra.

Page 38: Protein Sequencing and Identification by Mass Spectrometry.

Example of Ion Type

• Δ={δ1, δ2,…, δk}

• Ion types

{b, b-NH3, b-H2O}

correspond to

Δ={0, 17, 18}

*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity

Page 39: Protein Sequencing and Identification by Mass Spectrometry.

Match between Spectra and the Shared Peak Count• The match between two spectra is the number of masses

(peaks) they share (Shared Peak Count or SPC)

• In practice mass-spectrometrists use the weighted SPC

that reflects intensities of the peaks

• Match between experimental and theoretical spectra is

defined similarly

Page 40: Protein Sequencing and Identification by Mass Spectrometry.

Peptide Sequencing ProblemGoal: Find a peptide with maximal match between

an experimental and theoretical spectrum.Input:

• S: experimental spectrum• Δ: set of possible ion types• m: parent mass

Output: • P: peptide with mass m, whose theoretical

spectrum matches the experimental S spectrum the best

Page 41: Protein Sequencing and Identification by Mass Spectrometry.

Vertices of Spectrum Graph• Masses of potential N-terminal peptides

• Vertices are generated by reverse shifts corresponding to ion types

Δ={δ1, δ2,…, δk}

• Every N-terminal peptide can generate up to k ions

m-δ1, m-δ2, …, m-δk

• Every mass s in an MS/MS spectrum generates k vertices

V(s) = {s+δ1, s+δ2, …, s+δk}

corresponding to potential N-terminal peptides

• Vertices of the spectrum graph: {initial vertex}V(s1) V(s2) ... V(sm) {terminal vertex}

Page 42: Protein Sequencing and Identification by Mass Spectrometry.

Reverse Shifts

Shift in H2O+NH3

Shift in H2O

Page 43: Protein Sequencing and Identification by Mass Spectrometry.

Edges of Spectrum Graph

• Two vertices with mass difference

corresponding to an amino acid A:

• Connect with an edge labeled by A

• Gap edges for di- and tri-peptides

Page 44: Protein Sequencing and Identification by Mass Spectrometry.

Paths

• Path in the labeled graph spell out amino acid sequences

• There are many paths, how to find the correct one?

• We need scoring to evaluate paths

Page 45: Protein Sequencing and Identification by Mass Spectrometry.

Path Score• p(P,S) = probability that peptide P produces

spectrum S= {s1,s2,…sq}

• p(P, s) = the probability that peptide P generates a peak s

• Scoring = computing probabilities

• p(P,S) = πsєS p(P, s)

Page 46: Protein Sequencing and Identification by Mass Spectrometry.

• For a position t that represents ion type dj :

qj, if peak is generated at t

p(P,st) =

1-qj , otherwise

Peak Score

Page 47: Protein Sequencing and Identification by Mass Spectrometry.

Peak Score (cont’d)

• For a position t that is not associated with an ion type:

qR , if peak is generated at t

pR(P,st) =

1-qR , otherwise

• qR = the probability of a noisy peak that does not correspond to any ion type

Page 48: Protein Sequencing and Identification by Mass Spectrometry.

Finding Optimal Paths in the Spectrum Graph

• For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P:

• Peptides = paths in the spectrum graph

• P’ = the optimal path in the spectrum graph

p(P,S)p(P',S) Pmax

Page 49: Protein Sequencing and Identification by Mass Spectrometry.

Ions and Probabilities• Tandem mass spectrometry is characterized

by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}

• δi-ions of a partial peptide are produced independently with probabilities qi

Page 50: Protein Sequencing and Identification by Mass Spectrometry.

Ions and Probabilities

• A peptide has all k peaks with probability

• and no peaks with probability

• A peptide also produces a ``random noise'' with uniform probability qR in any position.

k

iiq

1

k

iiq

1

)1(

Page 51: Protein Sequencing and Identification by Mass Spectrometry.

Ratio Test Scoring for Partial Peptides

• Incorporates premiums for observed ions and penalties for missing ions.

• Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4.

The score is calculated as:RRRR q

q

q

q

q

q

q

q 4321

)1(

)1(

Page 52: Protein Sequencing and Identification by Mass Spectrometry.

Scoring Peptides

• T- set of all positions.

• Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that represent ions of partial peptides Pi.

• A peak at position tδj is generated with probability qj.

• R=T- U Ti - set of positions that are not associated with any partial peptides (noise).

Page 53: Protein Sequencing and Identification by Mass Spectrometry.

Probabilistic Model

• For a position t δj Ti the probability p(t, P,S) that peptide P produces a peak at position t.

• Similarly, for tR, the probability that P produces a random noise peak at t is:

otherwise1

position tat generated ispeak a if),,( j

j

j

q

qSPtP

otherwise1

position tat generated ispeak a if)(

R

RR q

qtP

Page 54: Protein Sequencing and Identification by Mass Spectrometry.

Probabilistic Score

• For a peptide P with n amino acids, the score for the whole peptides is expressed by the following ratio test:

n

i

k

j iR

i

R j

j

tp

SPtp

Sp

SPp

1 1 )(

),,(

)(

),(

Page 55: Protein Sequencing and Identification by Mass Spectrometry.

De Novo vs. Database Search S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

WR

A

C

VG

E

K

DW

LP

T

L T

WR

A

C

VG

E

K

DW

LP

T

L T

De Novo

AVGELTK

Database Search

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..