Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2...

49
Shotgun Proteomic Analysis Department of Cell Biology The Scripps Research Institute

Transcript of Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2...

Page 1: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Shotgun Proteomic Analysis

Department of Cell BiologyThe Scripps Research Institute

Page 2: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Biological/Functional Resolution of Experiments

Organelle

Cells/Tissues

Multiprotein Complex

Function

InteractionAnalysis

ExpressionAnalysis

InformationContent

Page 3: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Abu

ndan

ce

Time

Abu

ndan

ce

Time

µLC

“Shotgun Proteomics”

…SEQ

UES

T80 node Beowulf Computer

Cluster

Protein Mixture

proteolysis

Peptide Mixture

Output Filtering and Re-Assembly

DTA

Sele

ct

Abu

ndan

ce

m/z

Abu

ndan

ce

m/z

MS

3

Abu

ndan

ce

m/z

Abu

ndan

ce

m/z

32

Abu

ndan

ce

m/z

Abu

ndan

ce

m/z

2

MS/MS

1

Abu

ndan

ce

m/z

Abu

ndan

ce

m/z

1

Page 4: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Data Processing Issues with Shotgun Proteomics

1:1 Mixture of Unlabeled/15N-Labeled YeastSoluble Proteins Analyzed Using a Single 12h Analysis

304 (1.9x)157Protein ID’s(*2 peptides confirmed w/ RelEx)

891 (1.6x)559Protein ID’s (*1 peptide confirmed w/ RelEx)

86,950 (4.5 x)18,970MS/MS SpectraLTQLCQ

*RelEx was used to evaluate the presence of labeled isotopomer

Page 5: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Processing Tandem Mass SpectraProcessing Tandem Mass Spectra

Spectral Quality

Database Analysis

De Novo: Unusual,Unanticipated Features

Filter

SEQUEST™PepProb

GutenTag

Quantification RelX

Subtractive Analysis LibQUESTDTASelect/Contrast

Filter and Assembly DTASelectProtProb

Page 6: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Data Issues

• Data quality• How is a match determined?

– Protease issues?– Validation issues?

• Posttranslational Modifications?• Quantification?• Sampling Issues?

Page 7: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Spectral Filtering with Hand Crafted FeaturesSpectral Filtering with Hand Crafted Features

Called Good Called Bad %Correct+1 GOOD 671 75 89.9%+1 BAD 5585 11475 67.3%+2/+3 GOOD 3166 348 90.1%+2/+3 BAD 11611 26684 69.7%All GOOD 3837 423 90.1%All BAD 17196 38159 68.9%

Bern, Goldberg, MacDonald, Yates Bioinformatics (in press)

Page 8: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w
Page 9: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

•• Identification of Identification of PTMsPTMs

•• Sample is split into 3 aliquotsSample is split into 3 aliquots

•• Digest using 3 different proteasesDigest using 3 different proteases

•• Mix and analyze by LC/LCMix and analyze by LC/LC--MS/MSMS/MS

•• Interpret spectra using SEQUESTInterpret spectra using SEQUEST

MudPIT

SEQUEST

trypsin elastase subtilisin

MultiMulti--Enzyme Digestion ProceduresEnzyme Digestion Procedures

Page 10: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

High pH

Proteinase K

High pH/Proteinase K Method(hpPK Method)

Wu et al., Nat. Biotech. 21:532-538 (2003)Howell and Palade, J. Cell Biol. 92:822-832 (1982)

Page 11: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Overlapping Peptide Coverage(TM7) gi|13794265|ref|NP_056312.1|DKFZP564G2022 protein MAAAAWLQVL PVILLLLGAH PSPLSFFSAG PATVAAADRS KWHIPIPSGK NYFSFGKILF RNTTIFLKFD GEPCDLSLNI TWYLKSADCY NEIYNFKAEE VELYLEKLKE KRGLSGNIQT SSKLFQNCSE LFKTQTFSGD FMHRLPLLGE KQEAKENGTN LTFIGDKTAM HEPLQTWQDA PYIFIVHIGI SSSKESSKEN SLSNLFTMTV EVKGPYEYLT LEDYPLMIFF MVMCIVYVLF GVLWLAWSAC YWRDLLRIQF WIGAVIFLGM LEKAVFYAEF QNIRYKGESV QGALILAELL SAVKRSLART LVSIVSLGYG IVKPRLGVTL HKVVVAGALY LLFSGMEGVL RVTGAQTDLA SLAFIPLAFL DTALCWWIFI SLTQTMKLLK LRRNIVKLSL YRHFTNTLIL AVAASIVFII WTTMKFRIVT CQSDWRELWV DDAIWRLLFS MILFVIMVLW RPSANNQRFA FSPLSEEEEEDEQKEPMLKE SFEGMKMRST KQEPNGNSKV NKAQEDDLKW VEENVPSSVT DVALPALLDS DEERMITHFE RSKME

WVEENVPSSVTDVALPALLDS*DEERVEENVPSSVTDVALPALLDS*DEER

EENVPSSVTDVALPALLDS*DEERENVPSSVTDVALPALLDS*DEER

VPSSVTDVALPALLDS*DEERPSSVTDVALPALLDS*DEER

LPALLDS*DEERPALLDS*DEER

(TM6) gi|14249524|ref|NP_116213.1| hypothetical protein FLJ14681MVAACRSVAG LLPRRRRCFP ARAPLLRVAL CLLCWTPAAV RAVPELGLWL ETVNDKSGPL IFRKTMFNST DIKLSVKSFH CSGPVKFTIV WHLKYHTCHN EHSNLEELFQ KHKLSVDEDF CHYLKNDNCW TTKNENLDCN SDSQVFPSLN NKELINIRNV SNQERSMDVV ARTQKDGFHI FIVSIKTENT DASWNLNVSL SMIGPHGYIS ASDWPLMIFY MVMCIVYILY GILWLTWSAC YWKDILRIQF WIAAVIFLGM LEKAVFYSEY QNISNTGLST QGLLIFAELI SAIKRTLARL LVIIVSLGYG IVKPRLGTVM HRVIGLGLLY LIFAAVEGVM RVIGGSNHLA VVLDDIILAV IDSIFVWFIF ISLAQTMKTL RLRKNTVKFS LYRHFKNTLIFAVLASIVFM GWTTKTFRIA KCQSDWMERW VDDAFWSFLF SLILIVIMFL WRPSANNQRY AFMPLIDDSD DEIEEFMVTS ENLTEGIKLR ASKSVSNGTA KPATSENFDE DLKWVEENIP SSFTDVALPV LVDSDEEIMT RSEMAEKMFS SEKIM

WVEENIPSSFTDVALPVLVDS*DEEIMTRIPSSFTDVALPVLVDS*DEEIMTRPSSFTDVALPVLVDS*DEEIMTR

SFTDVALPVLVDS*DEEIMTRTDVALPVLVDS*DEEIMTRTDVALPVLVDS*DEEIMTRS

DVALPVLVDS*DEEIMTRVALPVLVDS*DEEIMTR

ALPVLVDS*DEEIMTRALPVLVDS*DEEIMTRS

LPVLVDS*DEEIMTRPVLVDS*DEEIMTRPVLVDS*DEEIMTRS

Page 12: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

gi|27229118|ref|NP_082129| RIKEN cDNA 0610006F02; S-adenosylmethionine-dependentmethyltransferase activity [Mus musculus]

MDALVLFLQL LVLLLTLPLH LLALLGCWQP ICKTYFPYFM AMLTARSYKK MESKKRELFS QIKDLKGTSG NVALLELGCG

TGANFQFYPQ GCKVTCVDPN PNFEKFLTKS MAENRHLQYE RFIVAYGENM KQLADSSMDV VVCTLVLCSV QSPRKVLQEVCVDPN PNFEKF

VTCVDPN PNFEKVTCVDPN PNFEKFLTK

QRVLRPGGLL FFWEHVAEPQ GSRAFLWQRV LEPTWKHIGD GCHLTRETWK DIERAQFSEV QLEWQPPPFR WLPVGPHIMGQFSEV QLEWQPPPFR WLPVGPHIM

EV QLEWQPPPFR WLPVGPHEV QLEWQPPPFR WLPVGPHEV QLEWQPPPFR WLPVGPHIMEV QLEWQPPPFR WLPVGPHIM**LEWQPPPFR WLPVGPHLEWQPPPFR WLPVGPHLEWQPPPFR WLPVGPHIMLEWQPPPFR WLPVGPHIMLEWQPPPFR WLPVGPHIMGEWQPPPFR WLPVGPHIMWQPPPFR WLPVGPH

KAVK

Page 13: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Golgi Peptide Synthetic Peptide

400 600 800 1000 1200 1400 1600 18000

20

40

60

80

100

y8b4

y14++

b9

b10b14

b13

b12

b11

y12

y11

y10y7y6

y5

y3

y12++

m/zR

elat

ive

Abu

ndan

ce

400 600 800 1000 1200 1400 1600 18000

20

40

60

80

100

y14++

b4b3 b9b10 b14

b13

b11

y12

y11

y10y8

y7y6

y5

y3

y12++

m/z

Rel

ativ

e A

bund

ance

Dimethyl Arginine Containing Peptide

Page 14: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Shotgun Proteomic Experiments Shotgun Proteomic Experiments and Sampling Issuesand Sampling Issues

•• Based on prior studies in yeast, we know Based on prior studies in yeast, we know not every protein present is not every protein present is id’did’d

•• Reproducibility is good for high abundance Reproducibility is good for high abundance proteins 70proteins 70--80%80%

•• Reproducibility is not as good for low Reproducibility is not as good for low abundance proteins. 20abundance proteins. 20--30%30%

•• Is this predictable?Is this predictable?

Page 15: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

2 4 6 8 10 12 14 161000

1500

2000

2500

3000

3500

4000

Num

ber o

f Ide

ntifi

ed P

rote

ins

Number of MudPIT Runs

Experimental Semi-empirical Model Random Model

Random Sampling Model for Data Dependent Acquisition

))/1(1(* SL NLnK −−=

nL= # of protein species at particular levelL = abundance levelN = total number of proteinsS = experiments

Page 16: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

2 4 6 8 10

100

200

300

400

500

600

700

Num

ber o

f Pro

tein

s

Num ber of M udPIT's

Distribution of Protein Identifications AfterRepeating Analysis 9 times

# of proteins identifiedin all 9 experiments

# of proteins identifiedin 1 experiment

Page 17: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

RelX software

DTASelect OutputPeptide SequenceLVNHFIQEFK

1270 1275 1280 1285 1290 12950

20

40

60

80

100

Rel

ativ

e Ab

unda

nce

m/z

1) Predict m/z Range

1270 1275 1280 1285 1290 12950

20

40

60

80

100

Rel

ativ

e Ab

unda

nce

m/z

2) Sum Signal in Range

32 33 34 35 36 370

20

40

60

80

100

Rel

ativ

e Ab

unda

nce

Time (min)

3) Store Mass Chromatograms

32 33 34 35 36 370

20

40

60

80

100

Rel

ativ

e A

bund

ance

Time (min)

4) Peak Detection

Sam

ple

Inte

nsity

Reference Intensity

5) Correlation

Page 18: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

1 2 3 4 5

1

2

3

4

5

6

7

Ave

rage

Pep

tide

Rat

io

Protein Mixture Ratio

TSA1 = 1.335±0.015

SSA1 = 1.004±0.018

ADH1 = 0.661±0.045

Systematic Errors are Present in Samples

Page 19: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Spectral Sampling for Relative QuantificationSpectral Sampling for Relative QuantificationCombined Data for 6 proteins added to Yeast SolubleCombined Data for 6 proteins added to Yeast Soluble

Cell Lysate at 4 different levelsCell Lysate at 4 different levels

y = 1.0613x + 0.2608R2 = 0.9997

0

5

10

15

20

25

30

0 5 10 15 20 25 30

% Protein Markers Added

% S

pect

rum

Obs

erve

d

Linear dynamicrange 2-orders

Measuring small changesis not as reliable

Page 20: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Synthesis of Ribosomal Proteins in Plasmodium falciparum

• Striking trend: almost all ribosomal proteins increase over ring to trophtransition

Ribosomal Protein Abundance

0

500

1000

1500

2000

2500

3000

3500

4000

Ring Troph Schiz Mero

Stage

Sp

ectr

al C

ou

nt

Page 21: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Standards

• 1. Data formats: McDonald et al MS1, MS2, and SQT - three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications, RCM 2004, 18, 2162-8.

• Database search standard based on: Washburn et al, Large Scale analysis of the yeast proteome via

multidimensional protein identification technology Nature Biotechnology 19, 242-247 (2001)

MacCoss et al , Probability Based Validation of Protein Identifications Using a Modified SEQUEST Algorithm, Analytical Chemistry 74, 5593-5599 (2002). Normalized Scores

Page 22: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Standards

• 4. Standards should not prevent innovation– Data formats should be practical e.g. storage space

• 7. Data processing tools should be transparent and validated, e.g. published

• 8. Data for publication: information to support biological conclusions- sequences of peptides id’d.

• 9. Archiving data: biological conclusions should be the most important part of the experiment

Page 23: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w
Page 24: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

0 5 10 15 200

2000

4000

6000

8000

10000

12000

Charge State +1Charge State +2Charge State +3

Num

ber o

f Spe

ctra

Probability Scores

Probability Distributions

Page 25: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Single Spectral Matches are Problematic: How Single Spectral Matches are Problematic: How to tell if they are correctto tell if they are correct

•• Searches determine closeness of fit based Searches determine closeness of fit based on some measure: Compare matches with on some measure: Compare matches with different programsdifferent programs

•• Probability scoringProbability scoring: P = random match : P = random match based on frequency of fragment ions in based on frequency of fragment ions in databasedatabase

•• SEQUESTSEQUEST: XCorr measures how close the : XCorr measures how close the spectrum fits to ideal spectrumspectrum fits to ideal spectrum

•• Manual validation, experimental validationManual validation, experimental validation•• de novode novo interpretationinterpretation

Page 26: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

•• Sample is split into 3 aliquotsSample is split into 3 aliquots

•• Digest using 3 different proteasesDigest using 3 different proteases

trypsin elastase subtilisin

MultiMulti--Enzyme DigestEnzyme Digest

Page 27: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

•• Sample is split into 3 aliquotsSample is split into 3 aliquots

•• Digest using 3 different proteasesDigest using 3 different proteases

•• Mix and analyze by LC/LCMix and analyze by LC/LC--MS/MSMS/MS

•• Interpret spectra using SEQUESTInterpret spectra using SEQUESTMudPIT

SEQUEST

trypsin elastase subtilisin

MultiMulti--Enzyme DigestEnzyme Digest

Page 28: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

0

10

20

30

40

50

60

70

80

90

100

% of Prote insIdentified

% of UniquePeptides

% of SpectrumCopies

P1P2P3P4

Properties of Data Dependent Data Acquisition

Most invariant property is spectral copy number

Page 29: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

GutenTagGutenTag: Partial de novo Sequencing : Partial de novo Sequencing of Tandem Mass Spectraof Tandem Mass Spectra

•• Database searching assumes minimal errors in the Database searching assumes minimal errors in the database and sequence variations between strains, database and sequence variations between strains, individuals and speciesindividuals and species

•• Modifications need to be specified in database Modifications need to be specified in database searches, so unanticipated modifications will be searches, so unanticipated modifications will be missed.missed.

•• Partial Partial De novoDe novo analysis of tandem mass spectra in analysis of tandem mass spectra in largelarge--scalescale can identify peptides containing can identify peptides containing sequence variations and unanticipated sequence variations and unanticipated modifications modifications

Page 30: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w
Page 31: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

GutenTagGutenTag

•• 6170 tandem mass spectra: LC/LC/MS/MS 6170 tandem mass spectra: LC/LC/MS/MS analysis of simple digested protein mixtureanalysis of simple digested protein mixture

•• 1987 spectra matched by SEQUEST1987 spectra matched by SEQUEST•• 1328 spectra matched by GutenTag1328 spectra matched by GutenTag•• 766 partial matches suggesting modifications and 766 partial matches suggesting modifications and

sequence variationssequence variations•• Total matching spectra by Total matching spectra by GutenTagGutenTag: 2,094: 2,094•• Partial de novo will extend identificationsPartial de novo will extend identifications•• Software is Software is AutomatedAutomated and and LargeLarge--ScaleScale

Page 32: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Improved Spectral Quality Effects Peptide IdentificationInfusion of 1 pmol/µl Angiotensin I

LCQ-Classic761 of 1000 MS/MS Spectra Matched the Correct Sequence

LTQ970 of 1000 MS/MS Spectra Matched the Correct Sequence

0.5 1.0 1.5 2.0 2.5 3.0 3.50

25

50

75

100

125

150

175

200

Freq

uenc

y

XCorr

0.5 1.0 1.5 2.0 2.5 3.0 3.50

50

100

150

200

250

300

Freq

uenc

y

XCorr

Page 33: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Database Searching with Tandem Database Searching with Tandem Mass SpectraMass Spectra

•• The goal is to identify peptides using MS/MS The goal is to identify peptides using MS/MS spectra and amino acid sequence databases.spectra and amino acid sequence databases.

•• Develop a probabilistic model that establishes a Develop a probabilistic model that establishes a relationship between the database sequences and relationship between the database sequences and the spectrum to complement quantitative measures the spectrum to complement quantitative measures of closenessof closeness--ofof--fit fit

•• Develop nonDevelop non--empirical probabilistic measures empirical probabilistic measures using crossusing cross--correlation measurementscorrelation measurements

Page 34: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Probability Model

• Null hypothesis: All fragment matches to MS/MS spectrum are by random.

• N, K – number of all fragments and all fragment matches, respectively. N1 is the number of fragments of a particular peptide which has K1 matches.

• We seek an amino acid sequence that has the smallest probability of being a random match.

1

111 *),( 11, NN

KNKN

KK

NK CCCNKP

−−=

Page 35: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Model Hypergeometric Distributions

0 5 10 15 20 25 300.00

0.05

0.10

0.15

0.20

0.25

K=300

K=200

K=100N=1000N1=30

Prob

abili

ty

Number of Successes

N, K – number of all fragments and all fragment matches, respectively. N1 is the number of fragments of a particular peptide which has K1 matches.

Page 36: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Significance of Peptide IdentificationSignificance of Peptide Identification

•• Not all identifications are significant: poor quality Not all identifications are significant: poor quality spectra of peptides , incomplete peptide spectra of peptides , incomplete peptide fragmentation, inaccuracies in database, fragmentation, inaccuracies in database, posttranslational modifications, MS/MS of posttranslational modifications, MS/MS of chemical noise and nonchemical noise and non--peptide molecules.peptide molecules.

•• Significance of a match (P_value) is also obtained Significance of a match (P_value) is also obtained from the hypergeometric distribution.from the hypergeometric distribution.

Page 37: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Significance of a Match

0 5 10 15 20 25 300.00

0.04

0.08

0.12

0.16

Significance of K1=14

H0

N=1000K=300N1=30

Prob

abili

ty

Number of Successes

Page 38: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Yeast Database

0 10 20 30 400.00

0.05

0.10

0.15

0.20

0.25Predicted DistributionObserved Frequency

N=5284150K=569160N1=40

Prob

abili

ty,F

requ

ency

Number of Fragment Matches

Page 39: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Cumulative Distribution

0 10 200.0

0.2

0.4

0.6

0.8

1.0C

umul

ativ

e D

istri

butio

nFu

nctio

n

Number of Fragment Matches

Page 40: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Mass/Charge State Dependence

• Scores that use closeness of fit measures can artificially inflate with weight/mass.

• This complicates use of uniform criteria for identification.

• Probabilities generated by hypergeometric distribution are charge/weight independent.

Page 41: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Cross-Correlation Score Distribution

0 1 2 3 4 5 6 7 80

500

1000

1500

2000

2500

3000

3500

Num

ber o

f Spe

ctra

Cross-Correlation Scores

XCorr+1 XCorr+2 XCorr+3

Page 42: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Database Dependence

• The probability inferred from the hypergeometric distribution is in principle database dependent.

• However, the dependency is very weak.

Page 43: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

NRP and Yeast Databases

0 10 20 300.00

0.05

0.10

0.15

0.20

0.25 NRP Database Yeast Database

Prob

abili

ty

Number of Fragment Ion Matches

Page 44: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Pep_Probe SummaryPep_Probe Summary

•• Implements 4 scoring schemes: hypergeometric, Implements 4 scoring schemes: hypergeometric, poissonpoisson, , maximum likelihood and crossmaximum likelihood and cross--correlation. Sorts results correlation. Sorts results either by hypergeometric or crosseither by hypergeometric or cross--correlation scores.correlation scores.

•• No enzyme specificity is assumed.No enzyme specificity is assumed.•• Reports significance of each match.Reports significance of each match.•• Can search for posttranslational modifications to three Can search for posttranslational modifications to three

different amino acids.different amino acids.•• Has been implemented to run on a standalone or compute Has been implemented to run on a standalone or compute

clusters.clusters.•• Runs on heterogeneous cluster of computers, in WINDOWS Runs on heterogeneous cluster of computers, in WINDOWS

and LINUX platforms.and LINUX platforms.

Page 45: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Processing Tandem Mass SpectraSpectral Quality

Database Analysis

Analyze Unusual orUnanticipated Features

Eliminate Poor Quality Spectra, Score Quality of

Other spectra

Multiple Methods With Different

Selectivity's

De novo or partialDe novo analysis

Assemble and Annotate Data Biological Significance

Page 46: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Increased Data Production Requires Automated Data AnalysisIncreased Data Production Requires Automated Data Analysis

MS/MS MS/MS ComparisonComparison

Library SearchingLibrary SearchingComparative AnalysisComparative AnalysisSubtractive AnalysisSubtractive Analysis

LIBQUEST LIBQUEST

Yates et al. Yates et al. Anal.Chem. Anal.Chem. 70, 3557 (1998)70, 3557 (1998)

Rel

ativ

e A

bund

ance

m/z

De NovoDe NovoSequencingSequencing

Alternate SplicingAlternate SplicingUnanticUnantic. Mod.. Mod.Unknown ORFsUnknown ORFs

GutenTagGutenTag

Tabb et al Anal. Tabb et al Anal. ChemChem

Existing SequenceExisting SequencePTM PTM

Database Search

Database Database SearchSearch

SEQUEST & Pep_ProbSEQUEST & Pep_ProbYates et al.,Yates et al., JASMS JASMS 5, 976 (1994)5, 976 (1994)Yates et al. Yates et al. Anal. Anal. ChemChem 67, 1426 (1995)67, 1426 (1995)Yates et al. Yates et al. Anal. ChemAnal. Chem. 67, 3205 (1995). 67, 3205 (1995)MacCoss et al. MacCoss et al. Anal. Anal. ChemChem (2002)(2002)Sadygov and Yates, Sadygov and Yates, Anal. Anal. ChemChem (in press)(in press)

DTASelectDTASelectPost Analysis Post Analysis

Review SoftwareReview Software

Link et al. Link et al. Nature Biotech. Nature Biotech. 1717, 676, 676--682 (1999)682 (1999)Tabb et. al. Tabb et. al. J. Proteome Res.J. Proteome Res. 1, 26, (2002) 1, 26, (2002)

Probability forProbability forProtein Protein

IdentificationIdentification

Link et al. Link et al. Nature Biotech. Nature Biotech. 1717, 676, 676--682 (1999)682 (1999)Tabb et. al. Tabb et. al. J. Proteome Res.J. Proteome Res. 1, 26, (2002) 1, 26, (2002)

Peptide IdPeptide IdProbabilityProbabilityPP--Val, S.C.Val, S.C.

Related SequenceRelated SequenceSNP AnalysisSNP AnalysisMutation AnalysisMutation Analysis

VariantVariantSearchSearch

SEQUESTSEQUEST--SNPSNP

Gatlin et al.Gatlin et al. Anal.Chem.Anal.Chem. 72, 757 (2000). 72, 757 (2000).

SpectralQuality

Bioinformatics (in press)Bioinformatics (in press)

Page 47: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Data Considerations

• Different types of experiments• Different types of data analysis

Page 48: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Integrated Multi-Dimensional Liquid Chromatography

hV

100 micron FSC

RPSCXRP

5 µM100-300 nL/min

WasteWaste

Page 49: Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2 peptides confirmed w/ RelEx) Protein ID’s 559 891 (1.6x) (*1 peptide confirmed w

Comprehensive Analysis of Complex Protein Mixtures

Total ProteinCharacterization

• Protein Identification: What’s there• Post Translational Modifications: Regulation• Quantification: Dynamics• Proteomic Data to Knowledge: Genetics, RNAi, siRNA

Multiprotein Complex/OrganelleCells/Tissues

Translation of technology development into biological discoveryTranslation of technology development into biological discovery