Proteomics Informatics –

64
Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)

description

Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing  (Week 4). Peptide Mapping - Mass Accuracy. Peptide Mapping Database Size. Human. C. elegans. S. cerevisiae. Peptide Mapping Cys -Containing Peptides. Human. C. elegans. - PowerPoint PPT Presentation

Transcript of Proteomics Informatics –

Page 1: Proteomics Informatics –

Proteomics Informatics – Protein identification I: searching protein

sequence collections and significance testing (Week 4)

Page 2: Proteomics Informatics –

2

Peptide Mapping - Mass Accuracy

Page 3: Proteomics Informatics –

3

Peptide MappingDatabase Size

C. elegans

S. cerevisiae

Human

Page 4: Proteomics Informatics –

4

Peptide MappingCys-ContainingPeptides

C. elegans

S. cerevisiae

Human

Page 5: Proteomics Informatics –

MS

Identification – Peptide Mass Fingerprinting

MS

Digestion

All Peptide Masses

Pick Protein

Compare, Score, Test Significance

Repeat for each protein

SequenceDB

Identified Proteins

Page 6: Proteomics Informatics –

ProFound – Search Parameters

http://prowl.rockefeller.edu/

Page 7: Proteomics Informatics –

ProFound – Protein Identification by Peptide Mapping

pattern

r

iiirr

ii F

mmrmm

gNrNIkPDIkP

2

1

20

minmax

1 2

)(

2exp

2!)!()|()|(

W. Zhang & B.T. Chait, Analytical Chemistry72 (2000) 2482-2489

Page 8: Proteomics Informatics –

ProFound Results

Page 9: Proteomics Informatics –

Peptide Mapping – Mass Accuracy

ProFound

0

1

2

3

4

5

6

7

0 0.5 1 1.5 2

Mass Tolerance (Da)

-log(

e)

Mascot

0

20

40

60

80

100

120

140

0 0.5 1 1.5 2

Mass Tolerance (Da)Sc

ore

Page 10: Proteomics Informatics –

Peptide Mapping - Database SizeS. cerevisiae

Fungi

All Taxa

Expectation Values

Peptide mapping example:S. Cerevisiae 4.8e-7

Fungi 8.4e-6

All Taxa 2.9e-4

Page 11: Proteomics Informatics –

Database size

Page 12: Proteomics Informatics –

Missed Cleavage Sites

u = 1

u = 2

u = 4

Expectation Values

Peptide mapping example:u=1 4.8e-7

u=2 1.1e-5

u=4 6.8e-4

Page 13: Proteomics Informatics –

Peptide Mapping - Partial Modifications

No Modifications

Phophorylation (S, T, or Y)

Searched Searched With Without Possible Modifications Phosphorylation

of S/T/Y

DARPP-32 0.00006 0.01

CFTR 0.00002 0.005

Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data.

Page 14: Proteomics Informatics –

Peptide Mapping - Ranking by Direct Calculation of the Significance

Page 15: Proteomics Informatics –

The response to random input data should be random.

Maximum number of correct identification and minimum number of incorrect identifications for any data set.

Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set.

The statistical significance of the results should be calculated.

The searches should be fast.

General Criteria for a Good Protein Identification Algorithms

Page 16: Proteomics Informatics –

Response to Random Data

Nor

mal

ized

Fre

quen

cy

Page 17: Proteomics Informatics –

Peptide FragmentationMass

Analyzer 1Frag-

mentation DetectorIon Source

Mass Analyzer 2

b

y

Page 18: Proteomics Informatics –

Identification – Tandem MS

Page 19: Proteomics Informatics –

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

Tandem MS – Sequence Confirmation

KLEDEELFGS

Page 20: Proteomics Informatics –

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

KLEDEELFGS

Tandem MS – Sequence Confirmation

Page 21: Proteomics Informatics –

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

KLEDEELFGS

Tandem MS – Sequence Confirmation

Page 22: Proteomics Informatics –

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022

KLEDEELFGS

Tandem MS – Sequence Confirmation

Page 23: Proteomics Informatics –

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022

KLEDEELFGS

Tandem MS – Sequence Confirmation

Page 24: Proteomics Informatics –

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022

113

KLEDEELFGS

113

Tandem MS – Sequence Confirmation

Page 25: Proteomics Informatics –

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022

129

129

KLEDEELFGS

Tandem MS – Sequence Confirmation

Page 26: Proteomics Informatics –

KLEDEELFGS

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022

Tandem MS – Sequence Confirmation

Page 27: Proteomics Informatics –

KLEDEELFGS

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022

Tandem MS – Sequence Confirmation

Page 28: Proteomics Informatics –

KLEDEELFGS

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022

Tandem MS – Sequence Confirmation

Page 29: Proteomics Informatics –

Tandem MS – de novo Sequencing

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292 405 5349071020663 778 1080

1022

Mass Differences

1-letter code

3-letter code

Chemical formula

Monoisotopic

Average

A Ala C3H5ON 71.0371 71.0788R Arg C6H12ON4 156.101 156.188N Asn C4H6O2N2 114.043 114.104D Asp C4H5O3N 115.027 115.089C Cys C3H5ONS 103.009 103.139E Glu C5H7O3N 129.043 129.116Q Gln C5H8O2N2 128.059 128.131G Gly C2H3ON 57.0215 57.0519H His C6H7ON3 137.059 137.141I Ile C6H11ON 113.084 113.159L Leu C6H11ON 113.084 113.159K Lys C6H12ON2 128.095 128.174M Met C5H9ONS 131.04 131.193F Phe C9H9ON 147.068 147.177P Pro C5H7ON 97.0528 97.1167S Ser C3H5O2N 87.032 87.0782T Thr C4H7O2N 101.048 101.105W Trp C11H10ON2 186.079 186.213Y Tyr C9H9O2N 163.063 163.176V Val C5H9ON 99.0684 99.1326

Amino acid masses

Sequences consistent

with spectrum

Page 30: Proteomics Informatics –

Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 129 145 244 274 373 403 502 518 615 647 760 762 819

292 97 113 212 242 341 371 470 486 583 615 728 730 787

389 16 115 145 244 274 373 389 486 518 631 633 690

405 99 129 228 258 357 373 470 502 615 617 674

504 30 129 159 258 274 371 403 516 518 575

534 99 129 228 244 341 373 486 488 545

633 30 129 145 242 274 387 389 446

663 99 115 212 244 357 359 416

762 16 113 145 258 260 317

778 97 129 242 244 301

875 32 145 147 204

907 113 115 172

1020 2 59

1022 57

Page 31: Proteomics Informatics –

Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 129 145 244 274 373 403 502 518 615 647 760 762 819

292 97 113 212 242 341 371 470 486 583 615 728 730 787

389 16 115 145 244 274 373 389 486 518 631 633 690

405 99 129 228 258 357 373 470 502 615 617 674

504 30 129 159 258 274 371 403 516 518 575

534 99 129 228 244 341 373 486 488 545

633 30 129 145 242 274 387 389 446

663 99 115 212 244 357 359 416

762 16 113 145 258 260 317

778 97 129 242 244 301

875 32 145 147 204

907 113 115 172

1020 2 59

1022 57

Page 32: Proteomics Informatics –

260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 E 145 244 274 373 403 502 518 615 647 760 762 819

292 P I/L 212 242 341 371 470 486 583 615 728 730 787

389 16 D 145 244 274 373 389 486 518 631 633 690

405 V E 228 258 357 373 470 502 615 617 674

504 30 E 159 258 274 371 403 516 518 575

534 V E 228 244 341 373 486 488 545

633 30 E 145 242 274 387 389 446

663 V D 212 244 357 359 416

762 16 I/L 145 258 260 317

778 P E 242 244 301

875 32 145 F 204

907 I/L D 172

1020 2 59

1022 G

Tandem MS – de novo Sequencing

X

X

X

X

X

X

…GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG……GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG…

Peptide M+H = 11661166 -1079 = 87 => S

SGF(I/L)EEDE(I/L)…

SGF(I/L)EEDE(I/L)…

1166 – 1020 – 18 = 128ÞK or Q

SGF(I/L)EEDE(I/L)(K/Q)

Page 33: Proteomics Informatics –

Tandem MS – de novo Sequencing

Challenges in de novo sequencing

Neutral loss (-H2O, -NH3)

Modifications

Background peaks

Incomplete information

Challenges in de novo sequencing

Neutral loss (-H2O, -NH3)

Modifications

Background peaks

Incomplete information

Page 34: Proteomics Informatics –

MS/MS

LysisFractionation

Tandem MS – Database Search

MS/MS

Digestion

SequenceDB

All FragmentMasses

Pick Protein

Compare, Score, Test Significance

Repeat for all proteins

Pick PeptideLC-MS

Repeat for

all peptides

Page 35: Proteomics Informatics –

Algorithms

Page 36: Proteomics Informatics –

Comparing and Optimizing Algorithms

Score

Score 1-Specificity

1-Specificity

Sens

itivi

tySe

nsiti

vity

Algorithm 1

Algorithm 2

True

True

False

False

Score

Score 1-Specificity

1-Specificity

Sens

itivi

tySe

nsiti

vity

Algorithm 1

Algorithm 2

True

True

False

False

Page 37: Proteomics Informatics –

37

MS/MS - Parent Mass Error and Enzyme Specificity

)!!( ybIII nnxx

Expectation Values

MS/MS example:Dm=2, Trypsin 2.5e-5

Dm=100, Trypsin 2.5e-5

Dm=2, non-specific 7.9e-5

Dm=100, non-specific 1.6e-4

Page 38: Proteomics Informatics –

Sequest

Cross-correlation

Page 39: Proteomics Informatics –

X! Tandem - Search Parameters

http://www.thegpm.org/

Page 40: Proteomics Informatics –

X! Tandem - Search Parameters

Page 41: Proteomics Informatics –

X! Tandem - Search Parameters

Page 42: Proteomics Informatics –

sequences

sequences

spectra

Conventional, single stage searching

Generic search engine

Test all cleavages,

modifications, & mutations

for all sequences

Page 43: Proteomics Informatics –

Determining potential modifications- e.g., oxidation, phosphorylation, deamidation

- calculation order 2n - NP complete

Some hard problems in MS/MS analysis in proteomics

Allowing for unanticipated peptide cleavages - e.g., chymotryptic contamination in trypsin - calculation order ~ 200 × tryptic cleavage - “unfortunate” coefficient

Detecting point mutations - e.g., sequence homology - calculation order 18N

- NP complete

Page 44: Proteomics Informatics –

sequences

sequences

spectra

Multi-stage searching

Trypticcleavage

Modifications #1

Modifications #2

Point mutation

X! Tandem

Page 45: Proteomics Informatics –

Search Results

Page 46: Proteomics Informatics –

Search Results

Page 47: Proteomics Informatics –

Sequence Annotations

Page 48: Proteomics Informatics –

Search Results

Page 49: Proteomics Informatics –

Search Results

Page 50: Proteomics Informatics –

LysisFractionation

DigestionLC-MS/MS

Identification – Spectrum Library Search

MS/MS

Spectrum Library

PickSpectrum

Compare, Score, Test Significance

Repeat for

all spectra

Identified Proteins

Page 51: Proteomics Informatics –

1. Find the best 10 spectra for a particular sequence, with the same PTMs and charge.2. Add the spectra together and normalize the intensity values.

3. Assign a “quality” value: the median expectation value of the 10 spectra used.

4. Record the 20 most intense peaks in the averaged spectrum, it’s parent ion z, m/z, sequence, protein accessions & quality.

Steps in making an Annotated Spectrum Library (ASL):

Page 52: Proteomics Informatics –

0

2

4

6

8

10

0 10 20 30 40 50

peptide length

fract

ion

of li

brar

y (%

)Spectrum Library Characteristics – Peptide Length

Page 53: Proteomics Informatics –

0

10

20

30

40

50

10 30 50 70 90 110 130 150 170 190

protein Mr (kDa)

% c

over

age

residuespeptides

Spectrum Library Characteristics – Protein Coverage

Page 54: Proteomics Informatics –

Library spectrum

Test spectrum(5:25)

(5:25)

Results: 4 peaks selected, 1 peak missed

Identification – Spectrum Library Search

Page 55: Proteomics Informatics –

Matches Probability1 0.452 0.153 0.0164 0.000395 0.0000037

Apply a hypergeometric probability model: - 25 possible m/z values; - 5 peaks in the library spectrum; and - 4 selected by the test spectrum.

How likely is this?Identification – Spectrum Library Search

Page 56: Proteomics Informatics –

If you have 1000 possible m/z values and 20 peaks in test and library spectrum?

1.0E-14

1.0E-12

1.0E-10

1.0E-08

1.0E-06

1.0E-04

1.0E-02

1.0E+00

1 2 3 4 5 6 7 8 9 10

matches

p 1 matched: p = 0.65 matched: p = 0.0002

10 matched: p = 0.0000000000001

Identification – Spectrum Library Search

Page 57: Proteomics Informatics –

ExperimentalMass Spectrum

Library of AssignedMass Spectra

M/Z

Best search result

Identification – Spectrum Library Search

Page 58: Proteomics Informatics –

X! Hunter

Page 59: Proteomics Informatics –

1. Use dot product to find a library spectrum that best matches a test spectrum.2. Calculate p-value with hypergeometric distribution.

3. Use p-value to calculate expectation value, given the identification parameters.4. If expectation value is less than the median expectation value of the library spectrum, report the median value.

X! Hunter algorithm:

Page 60: Proteomics Informatics –

X! Hunter Result

Query Spectrum

Library Spectrum

Page 61: Proteomics Informatics –

Significance Testing

False protein identification is caused by random matching

An objective criterion for testing the significance of protein identification results is necessary.

The significance of protein identifications can be tested once the distribution of scores for false results is known.

Page 62: Proteomics Informatics –

Significance Testing - Expectation Values

The majority of sequences in a collection will give a score due to random matching.

Page 63: Proteomics Informatics –

Database Search

M/Z

List of Candidates

ExtrapolateAnd Calculate Expectation Values

List of Candidates With Expectation Values

Distribution of Scoresfor Random and False Identifications

Significance Testing - Expectation Values

Page 64: Proteomics Informatics –

Proteomics Informatics – Protein identification I: searching protein

sequence collections and significance testing (Week 4)