EMBnet course Proteomics using Bioinformatics tools › CoursEMBnet › Basel08 ›...
Transcript of EMBnet course Proteomics using Bioinformatics tools › CoursEMBnet › Basel08 ›...
PMP
EMBnet course
Proteomics using Bioinformatics tools
MS identification tools
Patricia M. PalagiPIG, SIB, Geneva
PMP
The data: list of m/z values
840.6950 13.751676.9606 26.11498.8283 128.91045.564 845.22171.9670 2.56861.1073 371.2842.51458 53.71456.7274 12.9863.268365 3.1
MS
1163.7008 286.1105 220.142986.1738 13.7619102.0752 4.3810147.1329 57.3333185.1851 649.0953185.3589 5.3810186.1876 81.4286213.0791 1.4286
MS/MS
frag
men
t m
ass
valu
es
Peptide
mas
s va
lues
and inte
nsi
ties
frag
men
t in
tensi
ties
Parent mass chargeParent mass value
PMP
The tools
PMP
One direct access to all- ExPASy
http://www.expasy.org/tools/
PMP
Automatic protein identification
- Peptide mass fingerprinting – PMF
- MS/MS sequence search
- MS/MS spectra library search
- MS/MS prospective analysis (tag, open mod, de novo sequencing
PMP
Peptide mass fingerprinting = PMF MS database matching
Enzymaticdigestion
In-silicodigestion
Protein(s) Peptides840.6950861676.960631498.82831045.5642171.967066861.107346842.514581456.727405863.268365
Mass spectraPeaklist
…MAIILAGGHSVRFGPKAFAEVNGETFYSRVITLESTNMFNEIIISTNAQLATQFKYPNVVIDDENHNDKGPLAGIYTIMKQHPEEELFFVVSVDTPMITGKAVSTLYQFLV …
- MAIILAGGHSVR- FGPK- AFAEVNGETFYSR- VITLESTNMFNEIIISTNAQLATQFK- YPNVVIDDENHNDK…
Sequence database entry
861.107346838.6950861676.960631498.82831045.5642171.967066842.514581457.827405863.268453
Theoretical peaklist
Theoretical proteolytic peptides
Match
Result: ranked list of protein candidates
PMP
Peptide mass fingerprinting
What you have:- Set of peptide mass values- Information about the protein: molecular weight, pI, species.- Information about the experimental conditions: mass spectrometer precision, calibration used, possibility of missed-cleavages, possible modifications- Biological characteristics: post-translational modifications, fragments
What will do the tool:- Match between this information and a protein sequence database
What will you get:- a list of probable identified proteins
PMP
What is the expected information in a submission form?
• Place to upload a spectrum (many spectra)
• Description of the sample process used – Chemical process such as alkylation/reduction, – Cleavage properties (enzyme), – Mass tolerance (m/z tolerance)
• Search space – Sequence databank, – taxonomy restriction– Mw, pI restriction
• Scoring criteria and filters
PMP
One example of parameter effects on the search
• Accepted mass tolerancedue to imprecise measures and calibration problems
Source: Introduction to proteomics: tools for the new biology. Daniel C. Liebler. Human Press. 2002
PMP
Summary of PMF tools
Tool Source website
Aldente www.expasy.org/cgi-bin/aldente
Mascot www.matrixscience.com/
MS-Fit prospector.ucsf.edu/
ProFound prowl.rockefeller.edu/profound_bin/WebProFound.exe
PepMAPPER wolf.bms.umist.ac.uk/mapper/
PeptideSearch www.mann.embl-heidelberg.de/GroupPages/PageLink/peptidesearchpage.html
PepFrag prowl.rockefeller.edu/prowl/pepfragch.html
Non exhaustive list!
PMP
Scoring systems
• Essential for the identification! Gives a confidence value to each matched protein
• Three types of scores
• Shared peaks count (SPC): simply counts the number of matched mass values (peaks)
• Probabilistic scores: confidence value depends on probabilistic models or statistic knowledge used during the match (obtained from the databases)
• Statistic-learning: knowledge extraction from the influence of different properties used to match the proteins (obtained from the databases)
PMP
Mascot• Internet free version in the above website (commercial versions available too)•Choice of several databases.• Considers multiple chemical modifications.• 0 to 9 missed-cleavages. • Score based on a combination of probabilistic and statistic approaches (is based on Mowsescore).• Considers Swiss-Prot annotations for Splice Variants (in locally installed versions).
http://www.matrixscience.com/
PMP
Mascot - principles• Probability-based scoring• Computes the probability P that a match is
random • Significance threshold p< 0.05 (accepting that
the probability of the observed event occurring by chance is less than 5%)
• The significance of that result depends on the size of the database being searched.
• Mascot shades in green the insignificant hits• Score: -10Log10(P)
PMP
Mascot
Input
PMP
Hints about the significance of the score
Decoy Output
PMP
Sequence coverage
Peptides matched
Error function
Output
PMP
Aldente • SwissProt/TrEMBL db, indexed masses (trypsine and many others).
• Considers chemical modifications and user specified modifications.
• Considers biological modifications (annotations SWISS-PROT).
• 0 or 1 missed-cleavages.
• Use of robust alignment method (Hough transform):
• Determines deviation function of spectrometer
• Resolves ambiguities
• Less sensitive to noise
PMP
Aldente – summary
Exp
erim
enta
l m
asse
s /
pea
ks
Theoretical masses / peptides
Spectrometer calibration error
Spectrometer internal error
• The Hough Transform estimates from the experimental data the deviation function of the mass spectrometer (the calibration error function).
• The program optimizes the set of best matches, excluding noise and outliers, to find the best alignment.
PMP
Aldente - Input
PMP
Aldente - Input
PMP
Aldente - Input
PMP
Aldente - Input
PMP
Aldente - Input
PMP
Output
Hints about the significance of the score
PMP
Information from Swiss-Prot annotation. Processed protein (signal peptide is cleaved).
PMP
BioGraph
PMP
• A summary of the search parameters
• A list of potentially identified proteins (AC numbers) with scores and other evidences
• A detailed list of potentially identified peptides (associated or not to the potentially identified proteins) with scores
• Possibilities to validate/invalidate the provided results (info on the data processing, on the statistics, links to external resources, etc.)
• Possibilities to export the (validated) data in various formats
What is the expected information in an identification result?
PMP
- Exact primary structure- Splicing variants- Sequence conflicts- PTMs
1 protein entry does not represent1 unique molecule
Protein characterization with PMF data
Prediction tools• PTMs and AA substitutions • Oligosaccharide structures • Unspecific cleavages
FindModGlycoModFindPept
Characterization tools at ExPASy using peptide mass fingerprinting data http://www.expasy.org/tools/
PMP
SWISS-PROT feature table: active protein is more than just translation of
gene sequence (example: P20366)
PMP
Detection of PTMs in MS
624.
376
9.8
893.
4
994.
510
56.1
1326
.715
01.9
1759
.8 1923
.421
00.6
600 2200
624.
376
9.8
893.
4
994.
510
56.1
1326
.715
01.9
1759
.819
23.4
2100
.6
600 2200
624.
376
9.8
893.
4
994.
510
56.1
1326
.715
01.9
1759
.8 1923
.421
00.6
600 2200
624.
376
9.8
893.
4
994.
5
1070
.113
26.7
1501
.9
1759
.8 1923
.4
2100
.6
600 2200
Δ m/z => PTM Δ m/z => PTM
Unmodifiedtrypticmasses
Trypticmasses of a modifiedprotein
PMP
FindModhttp://www.expasy.org/tools/findmod/
DB entry
experimental
masses
experimental
options
AA modifications
PMP
FindMod Output
}
} unmodified peptides,modified peptidesknown in SWISS-PROT and chemically modified peptides
putatively modified peptides predictedby mass differences
+ putative AA substitutions
PMP
Modification rules can be defined from SWISS-PROT, PROSITE and the literature
modification amino acid rule exceptions
farnesylation Cys -
palmitoylation Cys Ser, ThrO-GlcNAc Ser, Thr Asn
amidation Xaa (C-term) where Gly followed Xaapyrrolidone carboxylic acid Gln (N-term) -
phosphorylation in eukaryotes: Ser, Thr, Asp, His, Tyr -in prokaryotes: Ser, Thr, Asp, His, Cys -
sulfatation in eukaryotes Tyr, PROSITE PDOC00003
some examples:
PMP
FindMod Output - Application of Rules
- potentially modified peptides that agree with rules are listed- amino acids that potentially carry modifications are shown
- peptides potentially modified only by mass difference
- predictions can be tested by MS-MS peptide fragmentation
PMP
• From MS (peptide mass fingerprint) data -detection of :– Matching peptides for unspecific cleavage– Masses resulting from possible contaminants– Matching peptides for specific cleavage (16 different
enzymes)– Peptides resulting from protease autolysis
FindPepthttp://www.expasy.org/tools/findpept.html
PMP
FindPep
PMP
FindPep
PMP
FindPep
PMP
FindPep
PMP
• Tag search- Tools that search peptides based on a MS/MS Sequence Tag
– MS-Tag and MS-Seq, PeptideSearch
• Ion search or PFF - Tools that match MS/MS experimental spectra with “theoretical spectra” obtained via in-silicofragmentation of peptides generated from a sequence database– Phenyx, Mascot, Sequest, X!Tandem, OMSSA, ProID, …
• de novo sequencing - Tools that directly interpret MS/MS spectra and try to deduce a sequence– Convolution/alignment (PEDENTA)– De-novo sequencing followed by sequence matching
(Peaks, Lutefisk, Sherenga, PeptideSearch)– Guided Sequencing (Popitam)
In all cases, the output is a peptide structure per MS/MS spectrum
MS/MS based identification tools
PMP
Peptide fragmentation fingerprinting = PFF = ion search
MS/MS database matchingEnzymaticdigestion
In-silicodigestion
Protein(s) Peptides340.695086676.96063498.8283545.5641171.967066261.107346342.51458456.727405363.268365
MS/MS spectra of peptides
Ions peaklists
…MAIILAGGHSVRFGPKAFAEVNGETFYSRVITLESTNMFNEIIISTNAQLATQFKYPNVVIDDENHNDKGPLAGIYTIMKQHPEEELFFVVSVDTPMITGKAVSTLYQFLV …
- MAIILAGGHSVR- FGPK- AFAEVNGETFYSR- VITLESTNMFNEIIIK- YPNVVIDDENNDK…
Sequence database entry
361.107346338.695086676.96063498.82831045.5641171.967066342.51458457.827405263.268453
Theoretical peaklist
Theoretical proteolytic peptides
Match
Result: ranked list of peptide
and protein
candidates
Theoretical fragmented
peptides
-MAIILAG-MAIILA-MAIIL-MAII-MAI-M-M-AIILAG
In-silicofragmentation
PMP
P' nterm P' cterm
offset-28-45-46
0-17-18
+17
[N] is the mass of the N-term group[M] is the mass of the sum of the neutral amino acid residue masses
It is very important to know the ionic series produced by a spectrometer, otherwise potential matches will be missed.
In the other hand, if an ion-type not present in the original spectrum is taken into account, it will contribute to get false positive matches.
Ion-types
+28+ 2-15-16
-15
PMP
Some PFF tools
Software Source website
InsPecT peptide.ucsd.edu/inspect.py
Mascot www.matrixscience.com/search_form_select.html
MS-Tag and MS-Seq prospector.ucsf.edu
PepFrag prowl.rockefeller.edu/prowl/pepfragch.html
Phenyx phenyx.vital-it.ch
Popitam www.expasy.org/tools/popitam
ProID (download) sashimi.sourceforge.net/software_mi.html
Sequest* fields.scripps.edu/sequest/index.html
Sonar 65.219.84.5/service/prowl/sonar.html
SpectrumMill* www.home.agilent.com
VEMS www.bio.aau.dk/en/biotechnology/vems.htm
X!Tandem (download) www.thegpm.org/TANDEM
*Commercialized
Same principle of a PMF, but using MS/MS spectra
Non exhaustive list!
PMP
PMP
Phenyx
PMP
The Phenyx Web Interface:One result, multiple views
Excel, xml and text exports
DesktopResultsviews
Submission
Management consoleResults comparison
P.A. Binz
PMP
The Proteins overview
List of identifiedproteins
Corresponding list of identified peptides
Protein group description
P.A. Binz
PMP
The Proteins overview
Hints about the significance of the score
PMP
Hints about the significance of the score
Better when high intensity peaks are matched and ion series are extended, without too many and too big holes
PMP
The scoring system in Phenyx
• The score is the sum of up to 12 basic scores such as:– presence of a, b, y, y++, B-H2O…; co-occurrence of ion series
(using HMMs), peak intensities, residue modifications (PTM or chemical), …
• True probabilistic approach for each peptide match
(likelihood of being correct)log --------------------------------
(likelihood of being random)
• Function of instruments and molecular types– Esquire 3000+, LCQ; iTRAQ vs. unmodified peptides
• Scores are normalised into z-scores
Search in a query database
Search in a randomized set of peptides
PMP
X!Tandem
www.thegpm.org
PMP
X!Tandem - output
1
2
3
PMP
The two-rounds searchMascot, Phenyx and X!Tandem
The identification process may be launched in 2-rounds
• Each round is defined with a set of search criteria– First round searches the selected database(s) with
stringent parameters,– Second round searches the proteins that have
passed the first round (relaxed parameters):
⇒Accelerate the job when looking for many variable modifications, or unspecific cleavages
⇒Appropriate when the first round defines stringent criteria to capture a protein ID, and the second round looks for looser peptide identifications
PMP
Example 2nd round
1rnd,Only 3 fixed mods131 valid, 75% cov.
2rnd,Add variable mods205 valid,84% cov.
2rnd,With all modsAnd half cleaved348 valid,90% cov.
PMP
Source of errors in assigning peptides
• Scores not adapted
• Parameters are too stringent or too loose
• Low MS/MS spectrum quality (many noise peaks, low signal to noise ratio, missing fragment ions, contaminants)
• Homologous proteins
• Incorrectly assigned charge state
• Pre-selection of the 2nd isotope (the parent mass is shifted of 1 Da. A solution is to take the parent mass tol. larger, but may drawn the good peptide too)…
• Novel peptide or variant
PMP
Hints to know when the identification is correct
• Good sequence coverage: the larger the sub-sequences and the higher the sequence coverage value, the better
• Consider the length of the protein versus the number of matched theoretical peptides
•Better when high intensity peaks have been used in the identification
•Scores: the higher, the better. The furthest from the 2nd hit the better
• Filter on the correct species if you know it (reduces the search space, time, and errors)
• Better when the errors are more or less constants among all peptides found.
•If you have time, try many tools and compare the results
With MS
PMP
Hints to know when the identification is correct
• The higher the number of peptides identified per protein, the better
• Sequence coverage: the larger the sub-sequences and the higher the sequence coverage value, the better
•Depends on the sample complexity and experiment workflow
• Scores: the higher, the better.
• Filter on the correct species if you know it (reduces the search space, time, and errors)
• Better when high intensity peaks are matched and ion series areextended, without too many and too big holes.
• Better when the errors are more or less constants among all ions.
• If you have time, try many tools and compare the results
With MS/MS
PMP
E-values
• For a given score S, it indicates the number of matches that are expected to occur by chance in a database with a score at least equal to S.
• The e-value takes into account the size of the database that was searched. As a consequence it has a maximum of the number of sequences in the database.
• The lower the e-value, the more significant the score is.
• An e-value depends on the calculation of the p-value.
PMP
p-value
• A p-value describes the probability, which assesses the chance of validly rejecting the null hypothesis. If the p-value is 10-5 then the rejection of the null hypothesis is due to chance with a probability of 10-5.
• A p-value has a maximum of 1.0.
• The larger the search space, the higher the p-value since the chance of a peptide being a random match increases.
• The lower the p-Value, the more significant is the match.
Source: Lisacek, Practical Proteomics, 2006 Sep;6 Suppl 2:22-32
PMP
Z-score• Z-score is a dimensionless quantity
derived by subtracting the population mean from an individual (raw) score and then dividing the difference by the population standard deviation.
• The z score reveals how many units of the standard deviation a case is above or below the mean.
σμ−
=−x
scoreZ
Source: wikipedia
PMP
So what?
• For small (significant) p-values, p and e are approximately equal, so the choice of one or the other is often equivalent. It is therefore reasonable to assimilate low p-values in Phenyx to e-values. X!Tandem simply switches e-values to log values to remove the powers of 10
• For a single search (or set of sampled peptides), you can compare z-scores. However, when two or more searches are performed on different size spaces, you first need to look at the p-values before comparing z-scores.
Source: Lisacek, Practical Proteomics, 2006 Sep;6 Suppl 2:22-32