Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2...
Transcript of Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2...
Shotgun Proteomic Analysis
Department of Cell BiologyThe Scripps Research Institute
Biological/Functional Resolution of Experiments
Organelle
Cells/Tissues
Multiprotein Complex
Function
InteractionAnalysis
ExpressionAnalysis
InformationContent
Abu
ndan
ce
Time
Abu
ndan
ce
Time
µLC
“Shotgun Proteomics”
…SEQ
UES
T80 node Beowulf Computer
Cluster
Protein Mixture
proteolysis
Peptide Mixture
Output Filtering and Re-Assembly
DTA
Sele
ct
Abu
ndan
ce
m/z
Abu
ndan
ce
m/z
MS
3
Abu
ndan
ce
m/z
Abu
ndan
ce
m/z
32
Abu
ndan
ce
m/z
Abu
ndan
ce
m/z
2
MS/MS
1
Abu
ndan
ce
m/z
Abu
ndan
ce
m/z
1
Data Processing Issues with Shotgun Proteomics
1:1 Mixture of Unlabeled/15N-Labeled YeastSoluble Proteins Analyzed Using a Single 12h Analysis
304 (1.9x)157Protein ID’s(*2 peptides confirmed w/ RelEx)
891 (1.6x)559Protein ID’s (*1 peptide confirmed w/ RelEx)
86,950 (4.5 x)18,970MS/MS SpectraLTQLCQ
*RelEx was used to evaluate the presence of labeled isotopomer
Processing Tandem Mass SpectraProcessing Tandem Mass Spectra
Spectral Quality
Database Analysis
De Novo: Unusual,Unanticipated Features
Filter
SEQUEST™PepProb
GutenTag
Quantification RelX
Subtractive Analysis LibQUESTDTASelect/Contrast
Filter and Assembly DTASelectProtProb
Data Issues
• Data quality• How is a match determined?
– Protease issues?– Validation issues?
• Posttranslational Modifications?• Quantification?• Sampling Issues?
Spectral Filtering with Hand Crafted FeaturesSpectral Filtering with Hand Crafted Features
Called Good Called Bad %Correct+1 GOOD 671 75 89.9%+1 BAD 5585 11475 67.3%+2/+3 GOOD 3166 348 90.1%+2/+3 BAD 11611 26684 69.7%All GOOD 3837 423 90.1%All BAD 17196 38159 68.9%
Bern, Goldberg, MacDonald, Yates Bioinformatics (in press)
•• Identification of Identification of PTMsPTMs
•• Sample is split into 3 aliquotsSample is split into 3 aliquots
•• Digest using 3 different proteasesDigest using 3 different proteases
•• Mix and analyze by LC/LCMix and analyze by LC/LC--MS/MSMS/MS
•• Interpret spectra using SEQUESTInterpret spectra using SEQUEST
MudPIT
SEQUEST
trypsin elastase subtilisin
MultiMulti--Enzyme Digestion ProceduresEnzyme Digestion Procedures
High pH
Proteinase K
High pH/Proteinase K Method(hpPK Method)
Wu et al., Nat. Biotech. 21:532-538 (2003)Howell and Palade, J. Cell Biol. 92:822-832 (1982)
Overlapping Peptide Coverage(TM7) gi|13794265|ref|NP_056312.1|DKFZP564G2022 protein MAAAAWLQVL PVILLLLGAH PSPLSFFSAG PATVAAADRS KWHIPIPSGK NYFSFGKILF RNTTIFLKFD GEPCDLSLNI TWYLKSADCY NEIYNFKAEE VELYLEKLKE KRGLSGNIQT SSKLFQNCSE LFKTQTFSGD FMHRLPLLGE KQEAKENGTN LTFIGDKTAM HEPLQTWQDA PYIFIVHIGI SSSKESSKEN SLSNLFTMTV EVKGPYEYLT LEDYPLMIFF MVMCIVYVLF GVLWLAWSAC YWRDLLRIQF WIGAVIFLGM LEKAVFYAEF QNIRYKGESV QGALILAELL SAVKRSLART LVSIVSLGYG IVKPRLGVTL HKVVVAGALY LLFSGMEGVL RVTGAQTDLA SLAFIPLAFL DTALCWWIFI SLTQTMKLLK LRRNIVKLSL YRHFTNTLIL AVAASIVFII WTTMKFRIVT CQSDWRELWV DDAIWRLLFS MILFVIMVLW RPSANNQRFA FSPLSEEEEEDEQKEPMLKE SFEGMKMRST KQEPNGNSKV NKAQEDDLKW VEENVPSSVT DVALPALLDS DEERMITHFE RSKME
WVEENVPSSVTDVALPALLDS*DEERVEENVPSSVTDVALPALLDS*DEER
EENVPSSVTDVALPALLDS*DEERENVPSSVTDVALPALLDS*DEER
VPSSVTDVALPALLDS*DEERPSSVTDVALPALLDS*DEER
LPALLDS*DEERPALLDS*DEER
(TM6) gi|14249524|ref|NP_116213.1| hypothetical protein FLJ14681MVAACRSVAG LLPRRRRCFP ARAPLLRVAL CLLCWTPAAV RAVPELGLWL ETVNDKSGPL IFRKTMFNST DIKLSVKSFH CSGPVKFTIV WHLKYHTCHN EHSNLEELFQ KHKLSVDEDF CHYLKNDNCW TTKNENLDCN SDSQVFPSLN NKELINIRNV SNQERSMDVV ARTQKDGFHI FIVSIKTENT DASWNLNVSL SMIGPHGYIS ASDWPLMIFY MVMCIVYILY GILWLTWSAC YWKDILRIQF WIAAVIFLGM LEKAVFYSEY QNISNTGLST QGLLIFAELI SAIKRTLARL LVIIVSLGYG IVKPRLGTVM HRVIGLGLLY LIFAAVEGVM RVIGGSNHLA VVLDDIILAV IDSIFVWFIF ISLAQTMKTL RLRKNTVKFS LYRHFKNTLIFAVLASIVFM GWTTKTFRIA KCQSDWMERW VDDAFWSFLF SLILIVIMFL WRPSANNQRY AFMPLIDDSD DEIEEFMVTS ENLTEGIKLR ASKSVSNGTA KPATSENFDE DLKWVEENIP SSFTDVALPV LVDSDEEIMT RSEMAEKMFS SEKIM
WVEENIPSSFTDVALPVLVDS*DEEIMTRIPSSFTDVALPVLVDS*DEEIMTRPSSFTDVALPVLVDS*DEEIMTR
SFTDVALPVLVDS*DEEIMTRTDVALPVLVDS*DEEIMTRTDVALPVLVDS*DEEIMTRS
DVALPVLVDS*DEEIMTRVALPVLVDS*DEEIMTR
ALPVLVDS*DEEIMTRALPVLVDS*DEEIMTRS
LPVLVDS*DEEIMTRPVLVDS*DEEIMTRPVLVDS*DEEIMTRS
gi|27229118|ref|NP_082129| RIKEN cDNA 0610006F02; S-adenosylmethionine-dependentmethyltransferase activity [Mus musculus]
MDALVLFLQL LVLLLTLPLH LLALLGCWQP ICKTYFPYFM AMLTARSYKK MESKKRELFS QIKDLKGTSG NVALLELGCG
TGANFQFYPQ GCKVTCVDPN PNFEKFLTKS MAENRHLQYE RFIVAYGENM KQLADSSMDV VVCTLVLCSV QSPRKVLQEVCVDPN PNFEKF
VTCVDPN PNFEKVTCVDPN PNFEKFLTK
QRVLRPGGLL FFWEHVAEPQ GSRAFLWQRV LEPTWKHIGD GCHLTRETWK DIERAQFSEV QLEWQPPPFR WLPVGPHIMGQFSEV QLEWQPPPFR WLPVGPHIM
EV QLEWQPPPFR WLPVGPHEV QLEWQPPPFR WLPVGPHEV QLEWQPPPFR WLPVGPHIMEV QLEWQPPPFR WLPVGPHIM**LEWQPPPFR WLPVGPHLEWQPPPFR WLPVGPHLEWQPPPFR WLPVGPHIMLEWQPPPFR WLPVGPHIMLEWQPPPFR WLPVGPHIMGEWQPPPFR WLPVGPHIMWQPPPFR WLPVGPH
KAVK
Golgi Peptide Synthetic Peptide
400 600 800 1000 1200 1400 1600 18000
20
40
60
80
100
y8b4
y14++
b9
b10b14
b13
b12
b11
y12
y11
y10y7y6
y5
y3
y12++
m/zR
elat
ive
Abu
ndan
ce
400 600 800 1000 1200 1400 1600 18000
20
40
60
80
100
y14++
b4b3 b9b10 b14
b13
b11
y12
y11
y10y8
y7y6
y5
y3
y12++
m/z
Rel
ativ
e A
bund
ance
Dimethyl Arginine Containing Peptide
Shotgun Proteomic Experiments Shotgun Proteomic Experiments and Sampling Issuesand Sampling Issues
•• Based on prior studies in yeast, we know Based on prior studies in yeast, we know not every protein present is not every protein present is id’did’d
•• Reproducibility is good for high abundance Reproducibility is good for high abundance proteins 70proteins 70--80%80%
•• Reproducibility is not as good for low Reproducibility is not as good for low abundance proteins. 20abundance proteins. 20--30%30%
•• Is this predictable?Is this predictable?
2 4 6 8 10 12 14 161000
1500
2000
2500
3000
3500
4000
Num
ber o
f Ide
ntifi
ed P
rote
ins
Number of MudPIT Runs
Experimental Semi-empirical Model Random Model
Random Sampling Model for Data Dependent Acquisition
))/1(1(* SL NLnK −−=
nL= # of protein species at particular levelL = abundance levelN = total number of proteinsS = experiments
2 4 6 8 10
100
200
300
400
500
600
700
Num
ber o
f Pro
tein
s
Num ber of M udPIT's
Distribution of Protein Identifications AfterRepeating Analysis 9 times
# of proteins identifiedin all 9 experiments
# of proteins identifiedin 1 experiment
RelX software
DTASelect OutputPeptide SequenceLVNHFIQEFK
1270 1275 1280 1285 1290 12950
20
40
60
80
100
Rel
ativ
e Ab
unda
nce
m/z
1) Predict m/z Range
1270 1275 1280 1285 1290 12950
20
40
60
80
100
Rel
ativ
e Ab
unda
nce
m/z
2) Sum Signal in Range
32 33 34 35 36 370
20
40
60
80
100
Rel
ativ
e Ab
unda
nce
Time (min)
3) Store Mass Chromatograms
32 33 34 35 36 370
20
40
60
80
100
Rel
ativ
e A
bund
ance
Time (min)
4) Peak Detection
Sam
ple
Inte
nsity
Reference Intensity
5) Correlation
1 2 3 4 5
1
2
3
4
5
6
7
Ave
rage
Pep
tide
Rat
io
Protein Mixture Ratio
TSA1 = 1.335±0.015
SSA1 = 1.004±0.018
ADH1 = 0.661±0.045
Systematic Errors are Present in Samples
Spectral Sampling for Relative QuantificationSpectral Sampling for Relative QuantificationCombined Data for 6 proteins added to Yeast SolubleCombined Data for 6 proteins added to Yeast Soluble
Cell Lysate at 4 different levelsCell Lysate at 4 different levels
y = 1.0613x + 0.2608R2 = 0.9997
0
5
10
15
20
25
30
0 5 10 15 20 25 30
% Protein Markers Added
% S
pect
rum
Obs
erve
d
Linear dynamicrange 2-orders
Measuring small changesis not as reliable
Synthesis of Ribosomal Proteins in Plasmodium falciparum
• Striking trend: almost all ribosomal proteins increase over ring to trophtransition
Ribosomal Protein Abundance
0
500
1000
1500
2000
2500
3000
3500
4000
Ring Troph Schiz Mero
Stage
Sp
ectr
al C
ou
nt
Standards
• 1. Data formats: McDonald et al MS1, MS2, and SQT - three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications, RCM 2004, 18, 2162-8.
• Database search standard based on: Washburn et al, Large Scale analysis of the yeast proteome via
multidimensional protein identification technology Nature Biotechnology 19, 242-247 (2001)
MacCoss et al , Probability Based Validation of Protein Identifications Using a Modified SEQUEST Algorithm, Analytical Chemistry 74, 5593-5599 (2002). Normalized Scores
Standards
• 4. Standards should not prevent innovation– Data formats should be practical e.g. storage space
• 7. Data processing tools should be transparent and validated, e.g. published
• 8. Data for publication: information to support biological conclusions- sequences of peptides id’d.
• 9. Archiving data: biological conclusions should be the most important part of the experiment
0 5 10 15 200
2000
4000
6000
8000
10000
12000
Charge State +1Charge State +2Charge State +3
Num
ber o
f Spe
ctra
Probability Scores
Probability Distributions
Single Spectral Matches are Problematic: How Single Spectral Matches are Problematic: How to tell if they are correctto tell if they are correct
•• Searches determine closeness of fit based Searches determine closeness of fit based on some measure: Compare matches with on some measure: Compare matches with different programsdifferent programs
•• Probability scoringProbability scoring: P = random match : P = random match based on frequency of fragment ions in based on frequency of fragment ions in databasedatabase
•• SEQUESTSEQUEST: XCorr measures how close the : XCorr measures how close the spectrum fits to ideal spectrumspectrum fits to ideal spectrum
•• Manual validation, experimental validationManual validation, experimental validation•• de novode novo interpretationinterpretation
•• Sample is split into 3 aliquotsSample is split into 3 aliquots
•• Digest using 3 different proteasesDigest using 3 different proteases
trypsin elastase subtilisin
MultiMulti--Enzyme DigestEnzyme Digest
•• Sample is split into 3 aliquotsSample is split into 3 aliquots
•• Digest using 3 different proteasesDigest using 3 different proteases
•• Mix and analyze by LC/LCMix and analyze by LC/LC--MS/MSMS/MS
•• Interpret spectra using SEQUESTInterpret spectra using SEQUESTMudPIT
SEQUEST
trypsin elastase subtilisin
MultiMulti--Enzyme DigestEnzyme Digest
0
10
20
30
40
50
60
70
80
90
100
% of Prote insIdentified
% of UniquePeptides
% of SpectrumCopies
P1P2P3P4
Properties of Data Dependent Data Acquisition
Most invariant property is spectral copy number
GutenTagGutenTag: Partial de novo Sequencing : Partial de novo Sequencing of Tandem Mass Spectraof Tandem Mass Spectra
•• Database searching assumes minimal errors in the Database searching assumes minimal errors in the database and sequence variations between strains, database and sequence variations between strains, individuals and speciesindividuals and species
•• Modifications need to be specified in database Modifications need to be specified in database searches, so unanticipated modifications will be searches, so unanticipated modifications will be missed.missed.
•• Partial Partial De novoDe novo analysis of tandem mass spectra in analysis of tandem mass spectra in largelarge--scalescale can identify peptides containing can identify peptides containing sequence variations and unanticipated sequence variations and unanticipated modifications modifications
GutenTagGutenTag
•• 6170 tandem mass spectra: LC/LC/MS/MS 6170 tandem mass spectra: LC/LC/MS/MS analysis of simple digested protein mixtureanalysis of simple digested protein mixture
•• 1987 spectra matched by SEQUEST1987 spectra matched by SEQUEST•• 1328 spectra matched by GutenTag1328 spectra matched by GutenTag•• 766 partial matches suggesting modifications and 766 partial matches suggesting modifications and
sequence variationssequence variations•• Total matching spectra by Total matching spectra by GutenTagGutenTag: 2,094: 2,094•• Partial de novo will extend identificationsPartial de novo will extend identifications•• Software is Software is AutomatedAutomated and and LargeLarge--ScaleScale
Improved Spectral Quality Effects Peptide IdentificationInfusion of 1 pmol/µl Angiotensin I
LCQ-Classic761 of 1000 MS/MS Spectra Matched the Correct Sequence
LTQ970 of 1000 MS/MS Spectra Matched the Correct Sequence
0.5 1.0 1.5 2.0 2.5 3.0 3.50
25
50
75
100
125
150
175
200
Freq
uenc
y
XCorr
0.5 1.0 1.5 2.0 2.5 3.0 3.50
50
100
150
200
250
300
Freq
uenc
y
XCorr
Database Searching with Tandem Database Searching with Tandem Mass SpectraMass Spectra
•• The goal is to identify peptides using MS/MS The goal is to identify peptides using MS/MS spectra and amino acid sequence databases.spectra and amino acid sequence databases.
•• Develop a probabilistic model that establishes a Develop a probabilistic model that establishes a relationship between the database sequences and relationship between the database sequences and the spectrum to complement quantitative measures the spectrum to complement quantitative measures of closenessof closeness--ofof--fit fit
•• Develop nonDevelop non--empirical probabilistic measures empirical probabilistic measures using crossusing cross--correlation measurementscorrelation measurements
Probability Model
• Null hypothesis: All fragment matches to MS/MS spectrum are by random.
• N, K – number of all fragments and all fragment matches, respectively. N1 is the number of fragments of a particular peptide which has K1 matches.
• We seek an amino acid sequence that has the smallest probability of being a random match.
1
111 *),( 11, NN
KNKN
KK
NK CCCNKP
−−=
Model Hypergeometric Distributions
0 5 10 15 20 25 300.00
0.05
0.10
0.15
0.20
0.25
K=300
K=200
K=100N=1000N1=30
Prob
abili
ty
Number of Successes
N, K – number of all fragments and all fragment matches, respectively. N1 is the number of fragments of a particular peptide which has K1 matches.
Significance of Peptide IdentificationSignificance of Peptide Identification
•• Not all identifications are significant: poor quality Not all identifications are significant: poor quality spectra of peptides , incomplete peptide spectra of peptides , incomplete peptide fragmentation, inaccuracies in database, fragmentation, inaccuracies in database, posttranslational modifications, MS/MS of posttranslational modifications, MS/MS of chemical noise and nonchemical noise and non--peptide molecules.peptide molecules.
•• Significance of a match (P_value) is also obtained Significance of a match (P_value) is also obtained from the hypergeometric distribution.from the hypergeometric distribution.
Significance of a Match
0 5 10 15 20 25 300.00
0.04
0.08
0.12
0.16
Significance of K1=14
H0
N=1000K=300N1=30
Prob
abili
ty
Number of Successes
Yeast Database
0 10 20 30 400.00
0.05
0.10
0.15
0.20
0.25Predicted DistributionObserved Frequency
N=5284150K=569160N1=40
Prob
abili
ty,F
requ
ency
Number of Fragment Matches
Cumulative Distribution
0 10 200.0
0.2
0.4
0.6
0.8
1.0C
umul
ativ
e D
istri
butio
nFu
nctio
n
Number of Fragment Matches
Mass/Charge State Dependence
• Scores that use closeness of fit measures can artificially inflate with weight/mass.
• This complicates use of uniform criteria for identification.
• Probabilities generated by hypergeometric distribution are charge/weight independent.
Cross-Correlation Score Distribution
0 1 2 3 4 5 6 7 80
500
1000
1500
2000
2500
3000
3500
Num
ber o
f Spe
ctra
Cross-Correlation Scores
XCorr+1 XCorr+2 XCorr+3
Database Dependence
• The probability inferred from the hypergeometric distribution is in principle database dependent.
• However, the dependency is very weak.
NRP and Yeast Databases
0 10 20 300.00
0.05
0.10
0.15
0.20
0.25 NRP Database Yeast Database
Prob
abili
ty
Number of Fragment Ion Matches
Pep_Probe SummaryPep_Probe Summary
•• Implements 4 scoring schemes: hypergeometric, Implements 4 scoring schemes: hypergeometric, poissonpoisson, , maximum likelihood and crossmaximum likelihood and cross--correlation. Sorts results correlation. Sorts results either by hypergeometric or crosseither by hypergeometric or cross--correlation scores.correlation scores.
•• No enzyme specificity is assumed.No enzyme specificity is assumed.•• Reports significance of each match.Reports significance of each match.•• Can search for posttranslational modifications to three Can search for posttranslational modifications to three
different amino acids.different amino acids.•• Has been implemented to run on a standalone or compute Has been implemented to run on a standalone or compute
clusters.clusters.•• Runs on heterogeneous cluster of computers, in WINDOWS Runs on heterogeneous cluster of computers, in WINDOWS
and LINUX platforms.and LINUX platforms.
Processing Tandem Mass SpectraSpectral Quality
Database Analysis
Analyze Unusual orUnanticipated Features
Eliminate Poor Quality Spectra, Score Quality of
Other spectra
Multiple Methods With Different
Selectivity's
De novo or partialDe novo analysis
Assemble and Annotate Data Biological Significance
Increased Data Production Requires Automated Data AnalysisIncreased Data Production Requires Automated Data Analysis
MS/MS MS/MS ComparisonComparison
Library SearchingLibrary SearchingComparative AnalysisComparative AnalysisSubtractive AnalysisSubtractive Analysis
LIBQUEST LIBQUEST
Yates et al. Yates et al. Anal.Chem. Anal.Chem. 70, 3557 (1998)70, 3557 (1998)
Rel
ativ
e A
bund
ance
m/z
De NovoDe NovoSequencingSequencing
Alternate SplicingAlternate SplicingUnanticUnantic. Mod.. Mod.Unknown ORFsUnknown ORFs
GutenTagGutenTag
Tabb et al Anal. Tabb et al Anal. ChemChem
Existing SequenceExisting SequencePTM PTM
Database Search
Database Database SearchSearch
SEQUEST & Pep_ProbSEQUEST & Pep_ProbYates et al.,Yates et al., JASMS JASMS 5, 976 (1994)5, 976 (1994)Yates et al. Yates et al. Anal. Anal. ChemChem 67, 1426 (1995)67, 1426 (1995)Yates et al. Yates et al. Anal. ChemAnal. Chem. 67, 3205 (1995). 67, 3205 (1995)MacCoss et al. MacCoss et al. Anal. Anal. ChemChem (2002)(2002)Sadygov and Yates, Sadygov and Yates, Anal. Anal. ChemChem (in press)(in press)
DTASelectDTASelectPost Analysis Post Analysis
Review SoftwareReview Software
Link et al. Link et al. Nature Biotech. Nature Biotech. 1717, 676, 676--682 (1999)682 (1999)Tabb et. al. Tabb et. al. J. Proteome Res.J. Proteome Res. 1, 26, (2002) 1, 26, (2002)
Probability forProbability forProtein Protein
IdentificationIdentification
Link et al. Link et al. Nature Biotech. Nature Biotech. 1717, 676, 676--682 (1999)682 (1999)Tabb et. al. Tabb et. al. J. Proteome Res.J. Proteome Res. 1, 26, (2002) 1, 26, (2002)
Peptide IdPeptide IdProbabilityProbabilityPP--Val, S.C.Val, S.C.
Related SequenceRelated SequenceSNP AnalysisSNP AnalysisMutation AnalysisMutation Analysis
VariantVariantSearchSearch
SEQUESTSEQUEST--SNPSNP
Gatlin et al.Gatlin et al. Anal.Chem.Anal.Chem. 72, 757 (2000). 72, 757 (2000).
SpectralQuality
Bioinformatics (in press)Bioinformatics (in press)
Data Considerations
• Different types of experiments• Different types of data analysis
Integrated Multi-Dimensional Liquid Chromatography
hV
100 micron FSC
RPSCXRP
5 µM100-300 nL/min
WasteWaste
Comprehensive Analysis of Complex Protein Mixtures
Total ProteinCharacterization
• Protein Identification: What’s there• Post Translational Modifications: Regulation• Quantification: Dynamics• Proteomic Data to Knowledge: Genetics, RNAi, siRNA
Multiprotein Complex/OrganelleCells/Tissues
Translation of technology development into biological discoveryTranslation of technology development into biological discovery