EMBnet course Proteomics using Bioinformatics tools › CoursEMBnet › Basel08 ›...

PMP

EMBnet course

Proteomics using Bioinformatics tools

MS identification tools

Patricia M. PalagiPIG, SIB, Geneva

PMP

The data: list of m/z values

840.6950 13.751676.9606 26.11498.8283 128.91045.564 845.22171.9670 2.56861.1073 371.2842.51458 53.71456.7274 12.9863.268365 3.1

MS

1163.7008 286.1105 220.142986.1738 13.7619102.0752 4.3810147.1329 57.3333185.1851 649.0953185.3589 5.3810186.1876 81.4286213.0791 1.4286

MS/MS

frag

men

t m

ass

valu

es

Peptide

mas

s va

lues

and inte

nsi

ties

frag

men

t in

tensi

ties

Parent mass chargeParent mass value

PMP

The tools

PMP

One direct access to all- ExPASy

http://www.expasy.org/tools/

PMP

Automatic protein identification

- Peptide mass fingerprinting – PMF

- MS/MS sequence search

- MS/MS spectra library search

- MS/MS prospective analysis (tag, open mod, de novo sequencing

PMP

Peptide mass fingerprinting = PMF MS database matching

Enzymaticdigestion

In-silicodigestion

Protein(s) Peptides840.6950861676.960631498.82831045.5642171.967066861.107346842.514581456.727405863.268365

Mass spectraPeaklist

…MAIILAGGHSVRFGPKAFAEVNGETFYSRVITLESTNMFNEIIISTNAQLATQFKYPNVVIDDENHNDKGPLAGIYTIMKQHPEEELFFVVSVDTPMITGKAVSTLYQFLV …

- MAIILAGGHSVR- FGPK- AFAEVNGETFYSR- VITLESTNMFNEIIISTNAQLATQFK- YPNVVIDDENHNDK…

Sequence database entry

861.107346838.6950861676.960631498.82831045.5642171.967066842.514581457.827405863.268453

Theoretical peaklist

Theoretical proteolytic peptides

Match

Result: ranked list of protein candidates

PMP

Peptide mass fingerprinting

What you have:- Set of peptide mass values- Information about the protein: molecular weight, pI, species.- Information about the experimental conditions: mass spectrometer precision, calibration used, possibility of missed-cleavages, possible modifications- Biological characteristics: post-translational modifications, fragments

What will do the tool:- Match between this information and a protein sequence database

What will you get:- a list of probable identified proteins

PMP

What is the expected information in a submission form?

• Place to upload a spectrum (many spectra)

• Description of the sample process used – Chemical process such as alkylation/reduction, – Cleavage properties (enzyme), – Mass tolerance (m/z tolerance)

• Search space – Sequence databank, – taxonomy restriction– Mw, pI restriction

• Scoring criteria and filters

PMP

One example of parameter effects on the search

• Accepted mass tolerancedue to imprecise measures and calibration problems

Source: Introduction to proteomics: tools for the new biology. Daniel C. Liebler. Human Press. 2002

PMP

Summary of PMF tools

Tool Source website

Aldente www.expasy.org/cgi-bin/aldente

Mascot www.matrixscience.com/

MS-Fit prospector.ucsf.edu/

ProFound prowl.rockefeller.edu/profound_bin/WebProFound.exe

PepMAPPER wolf.bms.umist.ac.uk/mapper/

PeptideSearch www.mann.embl-heidelberg.de/GroupPages/PageLink/peptidesearchpage.html

PepFrag prowl.rockefeller.edu/prowl/pepfragch.html

Non exhaustive list!

PMP

Scoring systems

• Essential for the identification! Gives a confidence value to each matched protein

• Three types of scores

• Shared peaks count (SPC): simply counts the number of matched mass values (peaks)

• Probabilistic scores: confidence value depends on probabilistic models or statistic knowledge used during the match (obtained from the databases)

• Statistic-learning: knowledge extraction from the influence of different properties used to match the proteins (obtained from the databases)

PMP

Mascot• Internet free version in the above website (commercial versions available too)•Choice of several databases.• Considers multiple chemical modifications.• 0 to 9 missed-cleavages. • Score based on a combination of probabilistic and statistic approaches (is based on Mowsescore).• Considers Swiss-Prot annotations for Splice Variants (in locally installed versions).

http://www.matrixscience.com/

PMP

Mascot - principles• Probability-based scoring• Computes the probability P that a match is

random • Significance threshold p< 0.05 (accepting that

the probability of the observed event occurring by chance is less than 5%)

• The significance of that result depends on the size of the database being searched.

• Mascot shades in green the insignificant hits• Score: -10Log10(P)

PMP

Mascot

Input

PMP

Hints about the significance of the score

Decoy Output

PMP

Sequence coverage

Peptides matched

Error function

Output

PMP

Aldente • SwissProt/TrEMBL db, indexed masses (trypsine and many others).

• Considers chemical modifications and user specified modifications.

• Considers biological modifications (annotations SWISS-PROT).

• 0 or 1 missed-cleavages.

• Use of robust alignment method (Hough transform):

• Determines deviation function of spectrometer

• Resolves ambiguities

• Less sensitive to noise

PMP

Aldente – summary

Exp

erim

enta

l m

asse

s /

pea

ks

Theoretical masses / peptides

Spectrometer calibration error

Spectrometer internal error

• The Hough Transform estimates from the experimental data the deviation function of the mass spectrometer (the calibration error function).

• The program optimizes the set of best matches, excluding noise and outliers, to find the best alignment.

PMP

Aldente - Input

PMP

Output


PMP

Information from Swiss-Prot annotation. Processed protein (signal peptide is cleaved).

PMP

BioGraph

PMP

• A summary of the search parameters

• A list of potentially identified proteins (AC numbers) with scores and other evidences

• A detailed list of potentially identified peptides (associated or not to the potentially identified proteins) with scores

• Possibilities to validate/invalidate the provided results (info on the data processing, on the statistics, links to external resources, etc.)

• Possibilities to export the (validated) data in various formats

What is the expected information in an identification result?

PMP

- Exact primary structure- Splicing variants- Sequence conflicts- PTMs

1 protein entry does not represent1 unique molecule

Protein characterization with PMF data

Prediction tools• PTMs and AA substitutions • Oligosaccharide structures • Unspecific cleavages

FindModGlycoModFindPept

Characterization tools at ExPASy using peptide mass fingerprinting data http://www.expasy.org/tools/

PMP

SWISS-PROT feature table: active protein is more than just translation of

gene sequence (example: P20366)

PMP

Detection of PTMs in MS

624.

376

9.8

893.

4

994.

510

56.1

1326

.715

01.9

1759

.8 1923

.421

00.6

600 2200

624.

376

9.8

893.

4

994.

510

56.1

1326

.715

01.9

1759

.819

23.4

2100

.6

600 2200

624.

376

9.8

893.

4

994.

510

56.1

1326

.715

01.9

1759

.8 1923

.421

00.6

600 2200

624.

376

9.8

893.

4

994.

5

1070

.113

26.7

1501

.9

1759

.8 1923

.4

2100

.6

600 2200

Δ m/z => PTM Δ m/z => PTM

Unmodifiedtrypticmasses

Trypticmasses of a modifiedprotein

PMP

FindModhttp://www.expasy.org/tools/findmod/

DB entry

experimental

masses

experimental

options

AA modifications

PMP

FindMod Output

}

} unmodified peptides,modified peptidesknown in SWISS-PROT and chemically modified peptides

putatively modified peptides predictedby mass differences

+ putative AA substitutions

PMP

Modification rules can be defined from SWISS-PROT, PROSITE and the literature

modification amino acid rule exceptions

farnesylation Cys -

palmitoylation Cys Ser, ThrO-GlcNAc Ser, Thr Asn

amidation Xaa (C-term) where Gly followed Xaapyrrolidone carboxylic acid Gln (N-term) -

phosphorylation in eukaryotes: Ser, Thr, Asp, His, Tyr -in prokaryotes: Ser, Thr, Asp, His, Cys -

sulfatation in eukaryotes Tyr, PROSITE PDOC00003

some examples:

PMP

FindMod Output - Application of Rules

- potentially modified peptides that agree with rules are listed- amino acids that potentially carry modifications are shown

- peptides potentially modified only by mass difference

- predictions can be tested by MS-MS peptide fragmentation

PMP

• From MS (peptide mass fingerprint) data -detection of :– Matching peptides for unspecific cleavage– Masses resulting from possible contaminants– Matching peptides for specific cleavage (16 different

enzymes)– Peptides resulting from protease autolysis

FindPepthttp://www.expasy.org/tools/findpept.html

PMP

FindPep

PMP

• Tag search- Tools that search peptides based on a MS/MS Sequence Tag

– MS-Tag and MS-Seq, PeptideSearch

• Ion search or PFF - Tools that match MS/MS experimental spectra with “theoretical spectra” obtained via in-silicofragmentation of peptides generated from a sequence database– Phenyx, Mascot, Sequest, X!Tandem, OMSSA, ProID, …

• de novo sequencing - Tools that directly interpret MS/MS spectra and try to deduce a sequence– Convolution/alignment (PEDENTA)– De-novo sequencing followed by sequence matching

(Peaks, Lutefisk, Sherenga, PeptideSearch)– Guided Sequencing (Popitam)

In all cases, the output is a peptide structure per MS/MS spectrum

MS/MS based identification tools

PMP

Peptide fragmentation fingerprinting = PFF = ion search

MS/MS database matchingEnzymaticdigestion

In-silicodigestion

Protein(s) Peptides340.695086676.96063498.8283545.5641171.967066261.107346342.51458456.727405363.268365

MS/MS spectra of peptides

Ions peaklists

…MAIILAGGHSVRFGPKAFAEVNGETFYSRVITLESTNMFNEIIISTNAQLATQFKYPNVVIDDENHNDKGPLAGIYTIMKQHPEEELFFVVSVDTPMITGKAVSTLYQFLV …

- MAIILAGGHSVR- FGPK- AFAEVNGETFYSR- VITLESTNMFNEIIIK- YPNVVIDDENNDK…

Sequence database entry

361.107346338.695086676.96063498.82831045.5641171.967066342.51458457.827405263.268453

Theoretical peaklist

Theoretical proteolytic peptides

Match

Result: ranked list of peptide

and protein

candidates

Theoretical fragmented

peptides

-MAIILAG-MAIILA-MAIIL-MAII-MAI-M-M-AIILAG

In-silicofragmentation

PMP

P' nterm P' cterm

offset-28-45-46

0-17-18

+17

[N] is the mass of the N-term group[M] is the mass of the sum of the neutral amino acid residue masses

It is very important to know the ionic series produced by a spectrometer, otherwise potential matches will be missed.

In the other hand, if an ion-type not present in the original spectrum is taken into account, it will contribute to get false positive matches.

Ion-types

+28+ 2-15-16

-15

PMP

Some PFF tools

Software Source website

InsPecT peptide.ucsd.edu/inspect.py

Mascot www.matrixscience.com/search_form_select.html

MS-Tag and MS-Seq prospector.ucsf.edu

PepFrag prowl.rockefeller.edu/prowl/pepfragch.html

Phenyx phenyx.vital-it.ch

Popitam www.expasy.org/tools/popitam

ProID (download) sashimi.sourceforge.net/software_mi.html

Sequest* fields.scripps.edu/sequest/index.html

Sonar 65.219.84.5/service/prowl/sonar.html

SpectrumMill* www.home.agilent.com

VEMS www.bio.aau.dk/en/biotechnology/vems.htm

X!Tandem (download) www.thegpm.org/TANDEM

*Commercialized

Same principle of a PMF, but using MS/MS spectra

Non exhaustive list!

PMP

Phenyx

PMP

The Phenyx Web Interface:One result, multiple views

Excel, xml and text exports

DesktopResultsviews

Submission

Management consoleResults comparison

P.A. Binz

PMP

The Proteins overview

List of identifiedproteins

Corresponding list of identified peptides

Protein group description

P.A. Binz

PMP

The Proteins overview


PMP


Better when high intensity peaks are matched and ion series are extended, without too many and too big holes

PMP

The scoring system in Phenyx

• The score is the sum of up to 12 basic scores such as:– presence of a, b, y, y++, B-H2O…; co-occurrence of ion series

(using HMMs), peak intensities, residue modifications (PTM or chemical), …

• True probabilistic approach for each peptide match

(likelihood of being correct)log --------------------------------

(likelihood of being random)

• Function of instruments and molecular types– Esquire 3000+, LCQ; iTRAQ vs. unmodified peptides

• Scores are normalised into z-scores

Search in a query database

Search in a randomized set of peptides

PMP

X!Tandem

www.thegpm.org

PMP

X!Tandem - output

1

2

3

PMP

The two-rounds searchMascot, Phenyx and X!Tandem

The identification process may be launched in 2-rounds

• Each round is defined with a set of search criteria– First round searches the selected database(s) with

stringent parameters,– Second round searches the proteins that have

passed the first round (relaxed parameters):

⇒Accelerate the job when looking for many variable modifications, or unspecific cleavages

⇒Appropriate when the first round defines stringent criteria to capture a protein ID, and the second round looks for looser peptide identifications

PMP

Example 2nd round

1rnd,Only 3 fixed mods131 valid, 75% cov.

2rnd,Add variable mods205 valid,84% cov.

2rnd,With all modsAnd half cleaved348 valid,90% cov.

PMP

Source of errors in assigning peptides

• Scores not adapted

• Parameters are too stringent or too loose

• Low MS/MS spectrum quality (many noise peaks, low signal to noise ratio, missing fragment ions, contaminants)

• Homologous proteins

• Incorrectly assigned charge state

• Pre-selection of the 2nd isotope (the parent mass is shifted of 1 Da. A solution is to take the parent mass tol. larger, but may drawn the good peptide too)…

• Novel peptide or variant

PMP

Hints to know when the identification is correct

• Good sequence coverage: the larger the sub-sequences and the higher the sequence coverage value, the better

• Consider the length of the protein versus the number of matched theoretical peptides

•Better when high intensity peaks have been used in the identification

•Scores: the higher, the better. The furthest from the 2nd hit the better

• Filter on the correct species if you know it (reduces the search space, time, and errors)

• Better when the errors are more or less constants among all peptides found.

•If you have time, try many tools and compare the results

With MS

PMP

Hints to know when the identification is correct

• The higher the number of peptides identified per protein, the better

• Sequence coverage: the larger the sub-sequences and the higher the sequence coverage value, the better

•Depends on the sample complexity and experiment workflow

• Scores: the higher, the better.

• Filter on the correct species if you know it (reduces the search space, time, and errors)

• Better when high intensity peaks are matched and ion series areextended, without too many and too big holes.

• Better when the errors are more or less constants among all ions.

• If you have time, try many tools and compare the results

With MS/MS

PMP

E-values

• For a given score S, it indicates the number of matches that are expected to occur by chance in a database with a score at least equal to S.

• The e-value takes into account the size of the database that was searched. As a consequence it has a maximum of the number of sequences in the database.

• The lower the e-value, the more significant the score is.

• An e-value depends on the calculation of the p-value.

PMP

p-value

• A p-value describes the probability, which assesses the chance of validly rejecting the null hypothesis. If the p-value is 10-5 then the rejection of the null hypothesis is due to chance with a probability of 10-5.

• A p-value has a maximum of 1.0.

• The larger the search space, the higher the p-value since the chance of a peptide being a random match increases.

• The lower the p-Value, the more significant is the match.

Source: Lisacek, Practical Proteomics, 2006 Sep;6 Suppl 2:22-32

PMP

Z-score• Z-score is a dimensionless quantity

derived by subtracting the population mean from an individual (raw) score and then dividing the difference by the population standard deviation.

• The z score reveals how many units of the standard deviation a case is above or below the mean.

σμ−

=−x

scoreZ

Source: wikipedia

PMP

So what?

• For small (significant) p-values, p and e are approximately equal, so the choice of one or the other is often equivalent. It is therefore reasonable to assimilate low p-values in Phenyx to e-values. X!Tandem simply switches e-values to log values to remove the powers of 10

• For a single search (or set of sampled peptides), you can compare z-scores. However, when two or more searches are performed on different size spaces, you first need to look at the p-values before comparing z-scores.

Source: Lisacek, Practical Proteomics, 2006 Sep;6 Suppl 2:22-32

EMBnet course Proteomics using Bioinformatics tools › CoursEMBnet › Basel08 ›...

Documents

Transcript of EMBnet course Proteomics using Bioinformatics tools › CoursEMBnet › Basel08 ›...