Predicting Protein Sequences From Mass Spectral Data

63
Predicting Protein Sequences From Mass Spectral Data Gary Van Domselaar University of Alberta Canadian Proteomics Initiative May 18, 2004 Introduction Review: Protein Separation Cleavage Mass Spectra MS and MS/MS The Objective: Matching Mass Spectra to Protein Sequences

Transcript of Predicting Protein Sequences From Mass Spectral Data

Page 1: Predicting Protein Sequences From Mass Spectral Data

Predicting Protein Sequences From Mass

Spectral Data

Gary Van DomselaarUniversity of Alberta

Canadian Proteomics Initiative

May 18, 2004

Introduction

• Review:– Protein Separation

– Cleavage

– Mass Spectra

– MS and MS/MS

• The Objective: Matching Mass Spectra to Protein Sequences

Page 2: Predicting Protein Sequences From Mass Spectral Data

Introduction

• Strategies:– Maldi & Peptide Mass Fingerprinting

– MS/MS & Fragment Ion Searching

– MS/MS & Sequence Tag Searches

– MS/MS & De Novo Peptide Sequencing

Review: Protein Separation

High Performance Liquid Chromatography2D Gel Electrophoresis

Page 3: Predicting Protein Sequences From Mass Spectral Data

Protein Separation: 2D Gel Electrophoresis

SDSPAGE

Protein Separation: High Performance Liquid

Chromatography (HPLC)

Solvent

Solvent

Mixer Pump

SampleInjector

Column MassSpec

Page 4: Predicting Protein Sequences From Mass Spectral Data

Protein Separation: 1D PAGE LC/MS

Solvent

Solvent

Mixer Pump

Sample Injector

Column

Complex Protein Mixture SDS

PAGE

In-Gel Digestion

ESI-MS

Protein Separation: 2D LC/MS

Solvent

Solvent

Mixer Pump

Sample Injector

SCX RPC

Complex Protein Mixture

In-Solution Digestion

ESI-MS

Page 5: Predicting Protein Sequences From Mass Spectral Data

Review: Cleavage

http://ca.expasy.org/tools/peptidecutter/peptidecutter_enzymes.html

Protease Cleavage Rules

Trypsin XXX[KR]--[!P]XXX

Chymotrypsin XX[FYW]--[!P]XXX

Lys C XXXXXK-- XXXXX

Asp N endo XXXXXD-- XXXXX

CNBr XXXXXM--XXXXX

Page 6: Predicting Protein Sequences From Mass Spectral Data

Missed Cleavages• Proteases are not perfect enzymes

• Protease products are not confined to the predicted products

– Contaminating proteases

– PTMs at the recognition site blocks access

– Unexpected recognition sites:

• Ex: trypsin produces 'ragged termini' when two or more consecutive basic residues are present in the sequence

Missed Cleavages

>Protein 1acedfhsakdfqeasdfpkivtmeeewendadnfekqwfe

Sequence Tryptic Fragments (no missed cleavage)acedfhsak (1007.4251) dfgeasdfpk (1183.5266) ivtmeeewendadnfek (2098.8909) gwfe (609.2667)

Tryptic Fragments (1 missed cleavage)acedfhsak (1007.4251) dfgeasdfpk (1183.5266) ivtmeeewendadnfek 2098.8909) gwfe (609.2667)acedfhsakdfgeasdfpk (2171.9338)ivtmeeewendadnfekgwfe (2689.1398)dfgeasdfpkivtmeeewendadnfek (3263.2997)

Page 7: Predicting Protein Sequences From Mass Spectral Data

Autolysis Peaks

500 1000 1500 2000 2500

698

2098

11991007

609

450

2211 (trp)

1940 (trp)

Review: The Mass Spectrum

Rel

ativ

e In

tens

ity

Mass / Charge (m/z)

Page 8: Predicting Protein Sequences From Mass Spectral Data

Average Mass and Monoisotopic Mass

• Monoisotopic mass is the mass determined using the masses of the most abundant isotopes

• Average mass is the abundance weighted mass of all isotopic components

http://www.matrixscience.com/help/mass_accuracy_help.html

Average Mass and Monoisotopic Mass

http://65.219.84.5/moverz/tutorials/pages/peak.html

Page 9: Predicting Protein Sequences From Mass Spectral Data

Calculating Peptide Masses• Sum the monoisotopic residue masses

• Add mass of H2O (18.01056)

• Add mass of H+ (1.00785 to get M+H)

• If Met is oxidized add 15.99491

• If Cys has acrylamide adduct add 71.0371

• If Cys is iodoacetylated add 58.0071

• Other modifications are listed at– http://prowl.rockefeller.edu/aainfo/deltamassv2.html

• Only consider peptides with masses > 400

Amino Acid Residue Masses

Glycine 57.02147Alanine 71.03712Serine 87.03203Proline 97.05277Valine 99.06842Threonine 101.04768Cysteine 103.00919Isoleucine 113.08407Leucine 113.08407Asparagine 114.04293

Aspartic acid 115.02695Glutamine 128.05858Lysine 128.09497Glutamic acid 129.04264Methionine 131.04049Histidine 137.05891Phenylalanine 147.06842Arginine 156.10112Tyrosine 163.06333Tryptophan 186.07932

Monoisotopic Mass

Page 10: Predicting Protein Sequences From Mass Spectral Data

Amino Acid Residue Masses

Glycine 57.0520Alanine 71.0788Serine 87.0782Proline 97.1167Valine 99.1326Threonine 101.1051Cysteine 103.1448Isoleucine 113.1595Leucine 113.1595Asparagine 114.1039

Aspartic acid 115.0886Glutamine 128.1308Lysine 128.1742Glutamic acid 129.1155Methionine 131.1986Histidine 137.1412Phenylalanine 147.1766Arginine 156.1876Tyrosine 163.1760Tryptophan 186.2133

Average Mass

Review: ESI-MS Spectrum

http://www.astbury.leeds.ac.uk/Facil/MStut/mstutorial.htm

m/z = (MW + nH+)n

Page 11: Predicting Protein Sequences From Mass Spectral Data

Review: ESI-MS/MS

MS1 MS2

Collision Cell

Review: ESI-MS/MS

AEGKLRFK(biotin)

b1 A EGKLRFK(biotin) a

7b

2 AE GKLRFK(biotin) a

6

b3 AEG KLRFK(biotin) a

5

b4 AEGK LRFK(biotin) a

4

b5 AEGKL RFK(biotin) a

3

b6 AEGKLR FK(biotin) a

2

b7 AEGKLRF K(biotin) a

1

http://www.abrf.org/JBT/2000/December00/dec00bibbs.html

Page 12: Predicting Protein Sequences From Mass Spectral Data

Review: MALDI Spectra

http://biop.ox.ac.uk/www/lj2000/endicott/endicott_7.html

• Generates Singly Charged Ions

• High Upper Detection Limit

Matching Spectra and Protein Sequences

Protein Digest

MRNSYRFLASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRKTICAGALIAKDWVLTAAHCNLNKRSQVILGAHSITYEEPTKQIMLVKKEFPYPCYDPATREGDLKLLQL

In Silico DigestionProtein Database

LASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRTICAGALIAKDWVLTAAHCNLNKRITTTYEEPTKQIMLVKEFPYPCYDPATREGDLKLL

0

20

40

60

80

100

m/z

%T

IC

0

20

40

60

80

100

m/z

%T

IC

Theoretical MS

Experimental MS

?Mass Analysis

Page 13: Predicting Protein Sequences From Mass Spectral Data

Strategies for Matching Mass Spectra with Protein Sequences• Maldi & Peptide Mass Fingerprinting

• MS/MS & Fragment Ion Searching

• MS/MS & Sequence Tag Searches

• MS/MS & De Novo Peptide Sequencing

Peptide Mass Fingerprinting• Used to identify protein spots on gels or

protein peaks from an HPLC run

• Depends of the fact that if a peptide is cut up or fragmented in a known way, the resulting fragments (and resulting masses) are unique enough to identify the protein

• Requires a database of known sequences

• Uses software to compare observed masses with masses calculated from database

Page 14: Predicting Protein Sequences From Mass Spectral Data

Principles of Fingerprinting

>Protein 1acedfhsakdfqeasdfpkivtmeeewendadnfekqwfe

>Protein 2acekdfhsadfqeasdfpkivtmeeewenkdadnfeqwfe

>Protein 3acedfhsadfqekasdfpkivtmeeewendakdnfeqwfe

Sequence Mass (M+H) Tryptic Fragments

4842.05

4842.05

4842.05

acedfhsakdfgeasdfpkivtmeeewendadnfekgwfe

acekdfhsadfgeasdfpkivtmeeewenkdadnfeqwfe

acedfhsadfgekasdfpkivtmeeewendakdnfegwfe

Principles of Fingerprinting

>Protein 1acedfhsakdfqeasdfpkivtmeeewendadnfekqwfe

>Protein 2acekdfhsadfqeasdfpkivtmeeewenkdadnfeqwfe

>Protein 3acedfhsadfqekasdfpkivtmeeewendakdnfeqwfe

Sequence Mass (M+H) Mass Spectrum

4842.05

4842.05

4842.05

Page 15: Predicting Protein Sequences From Mass Spectral Data

Preparing a Peptide Mass Fingerprint Database

• Take a protein sequence database (Swiss-Prot or nr-GenBank)

• Determine cleavage sites and identify resulting peptides for each protein entry

• Calculate the mass (M+H) for each peptide

• Sort the masses from lowest to highest

• Have a pointer for each calculated mass to each protein accession number in databank

Building A PMF Database

>P12345acedfhsakdfqeasdfpkivtmeeewendadnfekqwfe

>P21234acekdfhsadfqeasdfpkivtmeeewenkdadnfeqwfe

>P89212acedfhsadfqekasdfpkivtmeeewendakdnfeqwfe

Sequence DB Calc. Tryptic Frags Mass Listacedfhsakdfgeasdfpkivtmeeewendadnfekgwfe

acekdfhsadfgeasdfpkivtmeeewenkdadnfeqwfe

acedfhsadfgekasdfpkivtmeeewendakdnfegwfe

450.2017 (P21234) 609.2667 (P12345) 664.3300 (P89212) 1007.4251 (P12345)1114.4416 (P89212)1183.5266 (P12345)1300.5116 (P21234) 1407.6462 (P21234)1526.6211 (P89212)1593.7101 (P89212) 1740.7501 (P21234) 2098.8909 (P12345)

Page 16: Predicting Protein Sequences From Mass Spectral Data

The Simplest Scoring Scheme: Peptide Counting

• Take a mass spectrum of a trypsin-cleaved protein (from gel or HPLC peak)

• Identify as many masses as possible in spectrum (avoid autolysis peaks)

• Compare query masses with database masses and calculate # of matches or matching score (based on length and mass difference)

• Rank proteins by number of hits and return top scoring entry – this is the protein of interest

Query vs. DatabaseQuery Masses Database Mass List Results

450.2017 (P21234) 609.2667 (P12345) 664.3300 (P89212) 1007.4251 (P12345)1114.4416 (P89212)1183.5266 (P12345)1300.5116 (P21234) 1407.6462 (P21234)1526.6211 (P89212)1593.7101 (P89212) 1740.7501 (P21234) 2098.8909 (P12345)

450.2201609.3667698.31001007.53911199.49162098.9909

2 Unknown masses1 hit on P212343 hits on P12345

Conclude the queryprotein is P12345

Page 17: Predicting Protein Sequences From Mass Spectral Data

Peptide Counting• Works well for high quality data

• Gives higher scores to larger proteins

• PeptIdent• http://us.expasy.org/tools/peptident.html

• PepSea• http://pepsea.protana.com/PA_PepSeaForm.html

• MS-Fit• http://prospector.ucsf.edu/ucsfhtml3.2/msfit.htm

MOWSE

• MOlecular Weight SEarch

• Scoring based on peptide frequency distribution from the OWL non redundant Database

BleasbyPappin DJC, Hojrup P, and Bleasby AJ (1993) Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol. 3:327-332

Page 18: Predicting Protein Sequences From Mass Spectral Data

>Protein 1acedfhsakdfqeasdfpkivtmeeewendadnfekqwfe

>Protein 2acekdfhsadfqeasdfpkivtmeeewenkdadnfeqwfe

>Protein 3MASMGTLAFD EYGRPFLIIK DQDRKSRLMG LEALKSHIMA AKAVANTMRT SLGPNGLDKMMVDKDGDVTV TNDGATILSM MDVDHQIAKL MVELSKSQDD EIGDGTTGVV VLAGALLEEAEQLLDRGIHP IRIAD

Sequence Mass (M+H) Tryptic Fragments

4842.05

4842.05

14563.36

acedfhsakdfgeasdfpkivtmeeewendadnfekgwfe

acekdfhsadfgeasdfpkivtmeeewenkdadnfeqwfe

SQDDEIGDGTTGVVVLAGALLEEAEQLLDR2DGDVTVTNDGATILSMMDVD HQIAKMASMGTLAFDEYGRPFLIIK2TSLGPNGLDKLMGLEALKLMVELSKAVANTMRSHIMAAKGIHPIRMMVDKDQDR

MOWSE

Page 19: Predicting Protein Sequences From Mass Spectral Data

>Protein 1acedfhsakdfqeasdfpkivtmeeewendadnfekqwfel

>Protein 2acekdfhsadfqeasdfpkivtmeeewenkdadnfeqwfekqwfei

MOWSE2. For each protein, place fragments into 100 Da bins.

Mol. Wt. Fr agment2098.8909 IVTMEEEWENDADNFEK1183.5266 DFQEASDFPK1007.4251 ACEDFHSAK 722.3508 QWFEL

1740.7500 DFHSADFQEASDFPK1407.6460 IVTMEEEWENK1456.6127 DADNFEQWFEK 722.3508 QWFEI

��� � ������� �������������� ������� � ��������� �"!#� $�% &�%�$�'�� (��)������ ���������*������ ��)������+������ ��*���� %�'�,.-"&�%�'�/0�1&�-�%�'123(��4������ ��+������5������ ��4������6������ ��5���� � ��������� �"!#� $�(�7�%�&�%�$�'1�"/�!8'1���9������ ��61������������ ��9�����1������� ������� %�'�/:�1&�-�%�'123(��������� ������� &�;�� %�'1,�-3&�()������ �������*������ )����+������ *����4������ +����5������ 4����61����� 5����

/�!8'3� <�7 /�!8'3���

MOWSE3. Divide the number of fragments for each bin by the total number of fragments for each 10 kDa protein interval=�> ? @�A�B�C�D E�?�F G"H�I J�K L1M N1O3PQN3RQS TU�V�V�V�W U1XQV�V Y Z�[�\0]3]3]1^ ]3_�` a�`�_�b1]3c X V1d XQU�eX�f�V�V�W U�V�V�V V V1d V�V�VX�g�V�V�W X�f�V�V V V1d V�V�VX�h V�V�W XQg�V�V `�b1i.j�a�`�b�k�]�a�j�`�b1l3c X V1d XQU�eX�m�V�V�W X�h V�V V V1d V�V�VX�e�V�V�W X�m�V�V V V1d V�V�VXQn�V�V�W X�e�V�V Y Z�[�\0]3]3]1^ ]3_�cpop`�a.`�_�b1]1k�^ b1] U V1d U�e�VX�q�V�V�W XQn�V�V V V1d V�V�VX�U�V�V�W X�q�V�V V V1d V�V�VX�XQV�V�W X�U�V�V `�b�k�]�a.jp`�b�lpc X V1d XQU�eX�V�V�V�W X�XQV�V a�r�]3`�b1i�j�a.c X V1d XQU�ef�V�V�W XQV�V�V V V1d V�V�Vg�V�V�W f�V�V V V1d V�V�Vh V�V�W g�V�V V V1d V�V�Vm�V�V�W h V�V U V1d U�e�Ve�V�V�W m�V�V V V1d V�V�Vn�V�V�W e�V�V V V1d V�V�V

k�^sb1] t�o k�^sb�] Y

Page 20: Predicting Protein Sequences From Mass Spectral Data

MOWSE4. For each 10 kD interval, normalize to the largest bin value=�> ? @�A�B�C�D E�?�F G"H�I J�K L1M N1O3PQN3RQS TU�V�V�V�W U1XQV�V Y Z�[�\0]3]3]1^ ]3_�` a�`�_�b1]3c X V1d XQU�e V�d eX�f�V�V�W U�V�V�V V V1d V�V�V VX�g�V�V�W X�f�V�V V V1d V�V�V VX�h V�V�W XQg�V�V `�b1i.j�a�`�b�k�]�a�j�`�b1l3c X V1d XQU�e V�d eX�m�V�V�W X�h V�V V V1d V�V�V VX�e�V�V�W X�m�V�V V V1d V�V�V VXQn�V�V�W X�e�V�V Y Z�[�\0]3]3]1^ ]3_�cpop`�a.`�_�b1]1k�^ b1] U V1d U�e�V XX�q�V�V�W XQn�V�V V V1d V�V�V VX�U�V�V�W X�q�V�V V V1d V�V�V VX�XQV�V�W X�U�V�V `�b�k�]�a.jp`�b�lpc X V1d XQU�e V�d eX�V�V�V�W X�XQV�V a�r�]3`�b1i�j�a.c X V1d XQU�e V�d ef�V�V�W XQV�V�V V V1d V�V�V Vg�V�V�W f�V�V V V1d V�V�V Vh V�V�W g�V�V V V1d V�V�V Vm�V�V�W h V�V U V1d U�e�V Xe�V�V�W m�V�V V V1d V�V�V Vn�V�V�W e�V�V V V1d V�V�V V

� H3M � J�K � �QN��

k�^sb1] t�o k�^sb�] Y

MOWSE5. Compare spectrum masses against fragment masslist for each protein in the database. Retrieve the frequency score for each match and multiply.

��� ��� ��������� G"H�I J1K L1M N1O3PQN3RQS TU�V�V�V�W U�X�V�V � ���������������� �!#"$!� �%��#& X V�d XQU�e V1d eXQf�V�V�W U�V�V�V V V�d V�V�V VXQg�V�V�W XQf�V�V V V�d V�V�V VX�h V�V�W XQg�V�V !�%�'�(�"�!�%*)$�*"�(#!�%,+�& X V�d XQU�e V1d eXQm�V�V�W X�h V�V V V�d V�V�V VX�e�V�V�W XQm�V�V V V�d V�V�V VX n�V�V�W X�e�V�V � ���������������� �&�-�!#"�!� $%���)��.%�� U V�d U�e�V XX�q�V�V�W X n�V�V V V�d V�V�V VXQU�V�V�W X�q�V�V V V�d V�V�V VX�X�V�V�W XQU�V�V !�%,)$�/"�(/!$%�+*& X V�d XQU�e V1d eXQV�V�V�W X�X�V�V "10$��!�%,'�(�"�& X V�d XQU�e V1d ef�V�V�W XQV�V�V V V�d V�V�V Vg�V�V�W f�V�V V V�d V�V�V Vh V�V�W g�V�V V V�d V�V�V Vm�V�V�W h V�V U V�d U�e�V Xe�V�V�W m�V�V V V�d V�V�V Vn�V�V�W e�V�V V V�d V�V�V V

� H"M ��J3K � �QN��

)2��%���3/- )2��%/�#�

1740.7500 1456.6127 722.3508

0.5 x 1 x 1 = 0.5

Page 21: Predicting Protein Sequences From Mass Spectral Data

MOWSE6. Invert and multiply, and normalize to an 'average' protein of 50 000 k Da:

PN = product of distribution frequency scores

H = 'Hit' Protein MW = 5672.48

50 000 PN x H

Score =

= 0.5 x 1 x 1 = 0.5

50 000 0.5 x 5672.48

= = 17.62

MOWSE4. For each 10 kD interval, normalize

��������� ������� � � ������������� ������� ��� ��� ������������ ������� ���������� ������� ��� ������ ������� � ��!#"$������%&��� "#����'$( ��� �����)������ �� ���� ���*������ ��)���� ���������� ��*���� � � ������������� ( ��� �����+������ ������� ������� ���$%,����� ��� ������������ ��+���� ���������� �������-� ��%.��� "#����'$( ��� ������������ ������� ��/ ������!0"���( ��� ���������� ������� �������� ����� � ������ ����� �)������ ���� %,����� ��� ���*������ )���� �������� *���� �

Page 22: Predicting Protein Sequences From Mass Spectral Data

MOWSE Takes into account relative abundance of peptides in the database when calculating scores. Protein size is compensated for.

The model consists of numerous spaces separated by 100 Da (the average aa mass).

Does not provide a measure of confidence for the prediction.

• MOWSE• http://www.hgmp.mrc.ac.uk/Bioinformatics/Web

app/mowse/

• MS-Fit• http://prospector.ucsf.edu/ucsfhtml3.2/msfit.htm

MOWSE

Page 23: Predicting Protein Sequences From Mass Spectral Data

MASCOT• Probability-based MOWSE

• The probability that the observed match between experimental data and a protein sequence is a random event is approximately calculated for each protein in the sequence database.Probability model details not published.

Perkins DN, Pappin DJC, Creasy DM, and Cottrell JS (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551-3567.

Extreme Value Distribution

0

1000

2000

3000

4000

5000

6000

7000

8000

<20 30 40 50 60 70 80 90 100 110 >120

P(x) = 1 - e -e-x

Page 24: Predicting Protein Sequences From Mass Spectral Data

MASCOT

Mascot/Mowse Scoring

• The Mascot Score is given as S = -10*Log(P), where P is the probability that the observed match is a random event

• Try to aim for probabilit ies where P<0.05 (less than a 5% chance the peptide mass match is random).

Page 25: Predicting Protein Sequences From Mass Spectral Data

ProFound

• Uses a bayesian probability model

• Takes individual properties of each protein in the database.

Bayes Theorem

• Describes the probability of some event given that some other event has already occurred (conditional probability).

P(A | B) = P(B | A) P(A)

P(B)

• “The probability of some event A occurring given that event B has occurred is equal to the probability of event B occurring given that event A has occurred, multiplied by the probability of event A occurring and divided by the probability of event B occurring”.

Likelihood

Prior Probability

Posterior Probability

Page 26: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem• Example:

• 0.1% of women aged 40 have breast cancer.

• For 40 y/o women with cancer, a mammography will show positive 95% of the time.

• For 40 y/o women without cancer, a mammography will show positive 10% of the time.

• A 40 y/o woman tests positive by mammography for breast cancer.

• What is the probability she really does have breast cancer?

Bayes TheoremP(Disease) = 0.001 P(No Disease) = 0.999

P(Positive test | Disease) = 0.95 P(Negative test | Disease) = 0.05

P(Negative test | No disease) = 0.90 P(Positive test | No disease) = 0.10

P(Disease | Positive test) = P(Positive test | Disease) P(Disease)

P(Positive test)

P(Positive test) = P(Disease) P(Positive test | Disease)

+ P(No disease) P(Positive test | No disease) = 0.001x0.95 + .999x0.10 = 0.101

P(Disease | Positive test) = 0.95 x 0.999 = 0.0094 (less than 1%).

0.101

Page 27: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem

������������ ������������������� ����The Product Rule:

Example: Draw a card from a deck of playing cards. What is the probability that it is the king of clubs?

A KingB ClubsC Deck

������� �����! "���#�%$'&�(*)+� ,.-'/10324/6587�9;:=<>�@?BA4CED�%�F�G� �H�I "����,.-'/10�� 7J9;:=<>� �DKCMLON

P>QSRUT'V�W>X;Y+Z�[]\_^�`ba c�d+egf�hji�P*QSZ�['\k^�`la RUT'V�WnmoV�p1c�d=eqfUh!P*Q#RUT'V�W_`la c�d=eqfUhP>QSRUT'V�W�XoY;Z�[]\_^�`ba c�d+egf6hFiJrgsut�v�tKsuwkx.yzrlsuwkx

Bayes Theorem and PMFP(D|kI) P(k|I)

P(D|I)P(k|DI) =

K The hypothesis “protein k is the protein being analyzed”D The experimental data = mi...mn I Background information

0

20

40

60

80

100

%TI

C

m1m2

m3

m4

m5

m6m7

m8

m9

m10

m12

m11

m13m14

m15

Page 28: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem and PMFP(D|kI) P(k|I)

P(D|I)P(k|DI) =

K The hypothesis “protein k is the protein being analyzedD The experimental data = mi...mn I The available background information (species,

approximate mass of the parent protein, cleavage enzyme, mass accuracy, etc.)

P(k|DI) The posterior probability that the hypothesis is true given the data D and the background information I.

P(k|I) The prior probability of the hypothesis given the background information I

P(D|kI) The likelihood probability that the data D would beobserved if the hypothesis were true.

P(D|I) A normalization constant, independent of K.

Bayes Theorem and PMF������� ����� ������� ���������������

P(k|I) The prior probability of the hypothesis given the background information I

•Zero for every hypothesis that doesnt satisfy the background information (protein molecular weight, cleavage enzyme, species, etc.)•Otherwise 2 possibilities:

1. A uniform probability for all hypothesis that satisfy the constraints (all proteins that have the correct MW, cleaved with the correct enzyme), therefore a constant.

2. The prior probability from a previous experiment (ie multiple digestions).

Page 29: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem and PMF

P(k|I) The prior probability of the hypothesis given the background information I

P(k|I) 0 (does not satisfy constraints)constant (no previous data available)P(k|DprevI) if previous data, Dprev is available

��� � � ��� ������ � ���� � � � � ��

Bayes Theorem and PMF

P(D|kI) The likelihood probability that the data D would beobserved if the hypothesis were true.

0

20

40

60

80

100

m/z

%T

IC

0

20

40

60

80

100

%TI

C

m1m

2

m3

m4

m5

m6m

7

m8

m9

m10

m12

m11

m13m

14

m15

2 subsets: hits(H) and misses (M).

The 'Product Rule' P(AB|C) = P(B|AC)P(A|C) can be used to factor the data into the probability for hits and the probability for misses given the hits.

c�i ������ ��� � ������ � �������������� �� �������� ����

!#"%$'& (�) *,+-!�" .�/,0 12 .31 45/,0 1 4768 & (�) * +�!#" .�/,0 12 & (�) *,!#" .91 4:/70 1 4 6-& (�)%.;/70 1 *

����� � ��� ������� � ���� � � � ����

Page 30: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem and PMF

������������ ���������� �� ��������� ������ ������ �!���"��� �� ���#����!�������$�$� ���%�&���'�(��� �� �Likelihood probability for hits

• Factor as products of probabilities for individual hits by applying the product rule:

)+*-,!.�/ 0132 465�7-8:9 ; < .0 )+*-, ; 1%2 465 ,>=�/ ; ? .1@7

A>B-C&D>E F�G�H!A>B�DIE C&F G�AJB�C3E F GExample: 3 hits, r = 3KML N3O�P QRTS UWV�X�Y KZL-N3OR N\[R N^]R_S U_V`X

Set m1 to 'A', m

2m

3 to 'B', kI to 'C'KML N OR N [R N ]R$S U_V`XaY KZL-N [R N ]R_S U_V N O X KZL-N OR_S UWV�X

Now set m2 to 'A', m

3 to 'B', kIm

1 to 'C'KML N [R N ]R_S U_V N OR X�Y KLbN ]R6S U_V N OR N [R X KZL-N [R$S U_V N OR X

KLbN OR N [R N ]R$S U_V�X�Y KZL-N ]R$S U_V N OR N [R X KML N [R_S UWV N OR X KZL-N OR$S U_V`X):*-, .�/ 01 2 465�7-8c9 ; < .

0 ):*-, ; 1 2 465 , =�/ ; ? .1 7 , =1Define as '1'

����� � ��� � ����� � �� ����� � ���

The logical product of 2 hypotheses: 1) the i th hit (H

i) originates from a

particular peptide in the protein k and 2) its measured mass is m

i.

Therefore

dfeg

Bayes Theorem and PMF

h#i-j3k�l mnTo p_q`r�stu v km hMi j u n_o pWq jxw�l u y knzr

j un|{+} u j u

~:������%� �$� �&��� � ������-� ~:� �#� ��� � �6� �!��� � ������� ~:� � � � �6� � ��� � ���� � ~:��� �6� �!��� � ���� ��� �� ~:� ��� � �6� �!��� � ���� � ~+��� �6� � ��� � ���� � � � ���������� � �

The product rule can be applied to separate H

i and m

i:

0

20

40

60

80

100

m/z

%T

IC

0

20

40

60

80

100

%T

IC

m1m

2

m3

m4

m5

m6m

7

m8

m9

m10

m12

m11

m13m

14

m15

mi0

Hi

mi

�����T�  ¢¡+£�¤¥�����T� ¡�£����¦ §� �¨¡�£

Page 31: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem and PMF

h#i-j3k�l mnTo p_q`r�stu v km hMi j u n_o pWq jxw�l u y knzr

~:� �J�� � �6� �>��� � ���� �-� ~��b�#� ��� � �6� �>��� � ���� �� ~:� �#� � �6� �!��� � ���� � ~:� ��� � �$� �&��� � ���� �#� �� ~+� � � � �6� � ��� � ���� � ~:� � � � �6� � ��� � ���� � � � ���� ����� � �

0

20

40

60

80

100

m/z%

TIC

0

20

40

60

80

100

%T

IC

m1m

2

m3

m4

m5

m6m

7

m8

m9

m10

m12

m11

m13m

14

m15

KML���� S U_V N���P � �`OR XThe probability for the i th measured peptide to be a hit, given protein k and i-1 previous hits.

i

i-1

N

N-i-1K#L�� � S U_V N ��P � ��OR X�Y ��� ��

�����T�  ¢¡+£�¤ ����� �b¡�£$���¦ §���¨¡�£

����� �  ¢¡�£�¤¢����� � ¡�£$� �f  � �¨¡|£

~+� � � � � �6� � ��� � ���� � � ~c��� � � � � �$� � ��� � ���� �� ~:� �#� � �6� �!��� � ���� � ~+� � �6� � ��� � ���� � � �� ~+� �#� � �6� �!��� � ���� � ~:��� �6� � ��� � ���� � � � ���������� � �

Bayes Theorem and PMF

h#i-j3k�l mnTo p_q`r�stu v km hMi j un_o pWq j3w�l u y knzr

0

20

40

60

80

100

m/z

%T

IC

0

20

40

60

80

100

%T

IC

m1m

2

m3

m4

m5

m6m

7

m8

m9

m10

m12

m11

m13m

14

m15

KML�� U_V N���P � �`OR N�� � XThe probability for the measured mass value to be mi given its theoretical mass mi0.

mi0

mi

Page 32: Predicting Protein Sequences From Mass Spectral Data

~:� �J�� � �6� �>��� � ���� �-� ~��b�#� ��� � �6� �>��� � ���� �� ~:� �#� � �6� �!��� � ���� � ~:� ��� � �$� �&��� � ���� �#� �� ~+� � � � �6� � ��� � ���� � ~:� � � � �6� � ��� � ���� � � � ���� ����� � �

Bayes Theorem and PMF

h#i-j3k�l mnTo p_q`r�stu v km hMi j u n_o pWq jxw�l u y knzr

0

20

40

60

80

100

%T

IC

The probability for the measured mass value to be mi given its theoretical mass mi0. Measured masses are normally distributed:

0

20

40

60

80

100

%T

IC

��� ��� ��� �� ���� ���� ���� � � ������ ��������

�����! " #%$��'&)( +*�,- �. &0/�1 23 �4 57698;:=<?>A@���. CBD�. &)/ E

5 3 E F��� ����� � �

�����T�  ¢¡+£�¤¥�����T� ¡�£����¦ §� �¨¡�£

~:� �J�� � �6� �>��� � ���� �-� ~��b�#� ��� � �6� �>��� � ���� �� ~:� �#� � �6� �!��� � ���� � ~:� ��� � �$� �&��� � ���� �#� �GIH7J+KML N O=P)QSRUT L VXWY[Z H\JCQ]L N O;PUQ9R0T L V�WY Q]L R Z��� ����� � �

�����T�  ¢¡+£�¤¥�����T� ¡�£����¦ §� �¨¡�£Bayes Theorem and PMF

h#i-j3k�l mnTo p_q`r�stu v km hMi j un_o pWq j3w�l u y knzr

0

20

40

60

80

100

%T

IC

If there exist more than one potential theoretical match within the tolerance of the measured mass, the probability for the i th hit is:

0

20

40

60

80

100

%T

IC

��� ��� ��� �� ���� ���� ���� � � ������ ��������

��� ����� � �

0

20

40

60

80

100

%T

IC

0

20

40

60

80

100

%T

IC

mi j0

gi

mi

'j ' potential matches

^=_ `Dab0c d�e `gfih j klabnm o ^=_ pq r asCt�u j q `vj c d�e `?fAh j k�abwm

x yz|{~}�{ y y� �=��� ji�q r as t����)��� � _ `7j {�`7j q f m ��=� j ���

Page 33: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem and PMF

h#i-j3k�l mn o p_q`r�s t u v km hMi j un o pWq j3w�l u y kn r�s tu

v km��� v k��� hZi } u � j u o p_q j w�l u y kn r

x�� j r a� yz|{�}+{ y y� ��� � j �q r as t����)��� � _ `7j {;`|j q f m ��=� j � �

x]yz y�;� ��� �q r as ���)��� � _ `DaC{�`7j q f m ���� a� �� yzg{ ��� y y� � � �=� �q r as � ���)��� � _ ` � {�`�� � q f m ��=� � � �

x _ zv{�� m �_ z m � � j r a��� y�~� �=� �q r as t����)�;� � _ `7j {�`vj q f m ��=� j � � �

Probability for hits for all massesProduct of individual hit Probabilities

Modified for multiple possible matches

��� � ������ � ���� � ~��b�#� ! � �6� �!��� � ����@� ~�� �!� � �6� �>��� � ���� ��� ! �Probability that the i th measured peptide is a hit

Probability that the masses match

�����T�  ¢¡+£�¤¥�����T� ¡�£����¦ §� �¨¡�£

Bayes Theorem and PMF

"�# $&%(' )*,+ -/.�021 #35476 08#93 098;:< = %)?>A@BDC E7F�GH = %I � JLKNMPO�Q #9$ < 4R$ < H S 0 TEUB <T VXWr the number of observed hitsN the total number of peptidesm

ithe measured mass of the ith peptide

mi j0

the theoretical mass for the ith hit

the measured mass standard deviationY

Z\[^]`_(acbedgf\Zh[jilkNm no ipnjqrksm njqLtu _(acbed^f\Zh[jivksm no _2acbwdeZh[jipnjqrksm njqet\_2a�bxivkwm no d

Probability for hits for all masses

�����T�  ¢¡+£�¤¥�����T� ¡�£����¦ §� �¨¡�£

Page 34: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem and PMF

Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no dLikelihood probability for misses given the hits

What are misses?

• The remaining measured masses that cannot be accounted for by the protein sequence (w).

• Errors in protein sequence, unknown modification, unexpected cleavage

• “Modified Peptides”

0

20

40

60

80

100

m/z

%T

IC

0

20

40

60

80

100

%TI

C

m1m

2

m3

m4

m5

m6m

7

m8

m9

m10

m12

m11

m13m

14

m15

������� ���� �������� ���������� ����

Bayes Theorem and PMF

Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no dLikelihood probability for misses given the hits

0

20

40

60

80

100

m/z

%T

IC

0

20

40

60

80

100

%TI

C

m1m

2

m3

m4

m5

m6m

7

m8

m9

m10

m12

m11

m13m

14

m15

• The total number of peptides in protein k is N

• The number of misses is w

• All misses are 'modified peptides'

• The number of modified peptides is J, which is between w and N-r (ie J includes unobserved modified peptides).

������� ���� �������� ���������� ����

Page 35: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem and PMF

Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no dThe probability for all misses can be factored like this:

� �� �������� � ��� ������� � ��� � � � � � � � �� � ��� � � � ����� ������ � � � � ����� � � � �� ! � �� �"� � � !� � ���� � � � �#� � � ! � �� �� �� �������� � ��� ������� � ��� ! � �� �"� $ � ! � � ! � � � ������� � � �#� � � ! � ��%�� �� �������� � ��� ����� � � ��� ! � �� �"� $ � ! � ���� � � � ��� � � ! � ��%� � � � � ! � ���&�'� � � ! � �� $ � ! �

� � � � � � �(�� � ������� � � � � �� ����)��� � � � � � � �� � � ������� � � �

������� ���� �������� ���������� ����

Bayes Theorem and PMF

Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no d�*� � � � � ���� � ���(� � � � � �+�� �(��)�� �*� � � ���� � � � � ! � �� � � $ � ! � ���(� � � � �#� � � ! � �� � �"� � � ! � �����'� � � ! � �� $ � ! �

Probability for there being J modified peptides, given protein k and r observed hits

,.-0/21 354 687'9 :;=<�>@?BADC :EFG H"IA2C : ? ADC :G >J?BADC :E

K ADC : LNMOQPSRTUP V RUW)XV T W)X V RSY T W�X P V#Z W�XV)[ YS\]W�X V0Z Y V)[ Y^\]W)W�X

������� ���� �������� ���������� ����

Page 36: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem and PMF

Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no d�*� � � � � ���� � ���(� � � � � �+�� �(��)�� �*� � � ���� � � � � ! � �� � � $ � ! � ���(� � � � �#� � � ! � �� � �"� � � ! � �����'� � � ! � �� $ � ! �

The probability for observing a modified peptide, given protein k, J modified peptides and r hits plus j-1 misses being observed already.

0

20

40

60

80

100

%TI

C

j-1

N-r-(j-1)# available peptides

J-(j-1) # remaining unobserved peptidesm1m

2

m3

m4

m5

m6m

7

m8

m9

m10

m12

m11

m13m

14

m15

��� ������� �� ������ ���� ����� ����� ���� ��� �! #"%$'&( *)+ *"�$,&

������� ���� �������� ���������� ����

Bayes Theorem and PMF

Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no d�*� � � � � ���� � ���(� � � � � �+�� �(��)�� �*� � � ���� � � � � ! � �� � � $ � ! � ���(� � � � �#� � � ! � �� � �"� � � ! � �����'� � � ! � �� $ � ! �

The likelihood probability for the modified peptide to have a measured mass m

r+ j

0

20

40

60

80

100

%T

IC

mmin

mmax

-/.�021 3�4 5 6879+021�: 1�3�4 ;�<= > 1�3�4�? @ A02BDC E�F#02BDG H j = 1...w

������� ���� �������� ���������� ����

Page 37: Predicting Protein Sequences From Mass Spectral Data

Bayes Theorem and PMF

Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no d�*� � � � � ���� � ���(� � � � � �+�� �(��)�� �*� � � ���� � � � � ! � �� � � $ � ! � ���(� � � � �#� � � ! � �� � �"� � � ! � �����'� � � ! � �� $ � ! ��*� � � � � ���� � ���(� � � � � �+�� �(��)���� ����� ���� ! � ���� ������� ������������� ������ ���"����� ��� �

��� �� ��� � �"� ��� ��� �Probability for all misses

������� ���� �������� ���������� ����

Bayes Theorem and PMF

Z [ ] � a�bedgf\Zh[ji ksm no i njqNksm njqLtu _(acbedgfhZ\[ i ksm no _2a�bed Zh[ji njqrksm njqet � acb/i ksm no d

!#"�$&%('�)�* %('�+-, .0/$1)�* %243576 8$:9<; =�>?$:9�@ A0B +CEDGF4HJI KLNM OQPR�S DUT V#W RUXD�T R�XZY[ \ HK^]`_a7b c#dfeg \ HhGikjml�npo(q D�F [ V?F [ g r R sc�a [s tvuw-xGy{z |Q}G~ � x �7�k��~ �x ��~ �1� G � <1&������ �{���4 � <�v�������k� � x � G �-� G 4 � ~ ���� G ��� �G� �� B�C E ��� B G H�  ¡

¢¤£�¥N¦ §©¨#ª<«Z¢¤£�¥N¦ ¨?ª�¢¤£¬§­¦ ¥f¨?ª

w�x y z |Q}G~ �kw®x | z }(~ x �¯�k��~ �x ��~ � � G � <1�° ± �� � B C E ��� B�G H� G �4 � <� � ������² � x � G �³� G 4 � ~ ��®� G ��´ µ

Page 38: Predicting Protein Sequences From Mass Spectral Data
Page 39: Predicting Protein Sequences From Mass Spectral Data

ProFound (PMF)• Bayesian approach considered to be the most

coherent, consistent and efficient of the statistical methods.

• Scores reflect the confidence level of the hypothesis that protein k is the sample protein based on the given information

• Scores improve with additional information (tag information)

• Can identify simple mixtures of proteins by fusing single proteins pairwise, in groups of three and so on.

ProFound Results

Page 40: Predicting Protein Sequences From Mass Spectral Data

Advantages of PMF

• Uses a “robust” & inexpensive form of MS (MALDI)

• Doesn’t require too much sample optimization

• Can be done by a moderately skilled operator (don’t need to be an MS expert)

• Widely supported by web servers

• Improves as DB’s get larger & instrumentation gets better

• Very amenable to high throughput robotics (up to 500 samples a day)

Limitations With PMF• Requires that the protein of interest already be in

a sequence database

• Not suitable for searching EST databases

• Typically not all predicted peptides are detected

– Poor solubility

– Selective ionization

– Short peptide length

– Post-translational modification

– Unexpected cleavage

– Contamination

• Spurious or missing critical mass peaks always lead to problem.

Page 41: Predicting Protein Sequences From Mass Spectral Data

Limitations With PMF

• Not suitable for identification of proteins in complex mixtures if unseparated mixtures are proteolyzed

• Mass resolution/accuracy is critical, best to have <20 ppm mass resolution.

• Generally found to only be about 40% effective in positively identifying gel spots

MS-MS and Fragment Ion Searching

• Provides precise sequence-specific data

• More informative than PMF

• Can be used for de novo sequencing

• Can be used to identify post-translational modifications.

Page 42: Predicting Protein Sequences From Mass Spectral Data

SEQUEST

• Compares predicted MS-MS spectra against observed daughter ion spectra to identify and rank matches

Yates JR III, Eng JK, McCormack AL, and Sheiltz D (1995) Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal. Chem. 67:1426-1436.

SEQUEST

0

20

40

60

80

100

m/z

%TI

C

I—T—T—T—Y—E—E—P—T—K

MRNSYRFLASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRKTICAGALIAKDWVLTAAHCNLNKRSQVILGAHSITTTYEEPTKQIMLVKKEFPYPCYDPATREGDLKLL

In Silico Digestion

LASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRTICAGALIAKDWVLTAAHCNLNKRITTTYEEPTKQIMLVKEFPYPCYDPATREGDLKLL

0

20

40

60

80

100

m/z

%TI

C

Protein Database

In Silico Fragmentation

Page 43: Predicting Protein Sequences From Mass Spectral Data

SEQUEST

m/z

Rank/Sp Sp

1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173 W-A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K

(M+H)+ deltCn XCor r I ons Reference Peptide

0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y

• Identify database peptides that match the parent mass

• Keep the 200 most intense peaks from the MS/MS spectrum

• Compare these fragment ions against the theoretical MS/MS spectrum from the database peptide and generate a preliminary score (Sp) based on the number of matching ions (Ions).

• Perform a cross-correlation analysis (Xcorr) on the top 500 preliminary scoring peptides.

• Sort candidate peptide by XCorr.

Interpreting SEQUEST Output

m/z

Rank/Sp Sp

1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173 W-A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K

(M+H)+ deltCn XCor r I ons Reference Peptide

0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y

Sp The preliminary score. Based on the number of matching ions. The higher the better. Larger peptides have bigger Sp values. A 20-residue peptide should have an Sp > 1000, a 6 residue peptide should have an Sp > 500.

Page 44: Predicting Protein Sequences From Mass Spectral Data

Interpreting SEQUEST Output

m/z

Rank/Sp Sp

1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173 W-A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K

(M+H)+ deltCn XCor r I ons Reference Peptide

0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y

Rank/Sp The ranked Sp. The first number is the current rank (1,2,3,4,5). The second number is the preliminary ranking.

Be wary of Rank/Sps that move up dramatically (eg 4/343).

Ideally, look for 1/1 for a good hit.

Interpreting SEQUEST Output

m/z

Rank/Sp Sp

1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173W A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K

(M+H)+ deltCn XCor r I ons Reference Peptide

0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y

DeltCn The delta correlation value. Tells you how different the first hit is from the subsequent hits. Values of DeltCn >0.1 indicate a good top hit.

Page 45: Predicting Protein Sequences From Mass Spectral Data

Interpreting SEQUEST Output

m/z

Rank/Sp Sp

1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173W A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K

(M+H)+ deltCn XCor r I ons Reference Peptide

0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y

XCorr . The cross-cor relation value from the search. Used to produce the final ranking. Xcorr > 2.0 are usually good hits. Increases with increasing peptide size. For 20 residue peptide, look for Xcorr > 5.For 6 residue peptide look for Xcorr > 1.5.

Interpreting SEQUEST Output

m/z

Rank/Sp Sp

1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173W A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K

(M+H)+ deltCn XCor r I ons Reference Peptide

0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y

Ions. How many of the (top 200 most intense) exper imental ions matched up with theoretical ions.

70% or 80% coverage is good.

Page 46: Predicting Protein Sequences From Mass Spectral Data

SEQUEST Summary

m/z

• Gives a concise overview of a batch of search results without the necessity of having to look at each individual SEQUEST output files.

• Performs protein identification by noting which proteins are most prevalent in a set of SEQUEST output results.

SEQUEST Summary

m/z

MS Spectrum NumberTotal Ion Current (> 5 E+5)

Result File NumberCharge State

Delta Mass (Exp. - Theory)

Page 47: Predicting Protein Sequences From Mass Spectral Data

SEQUEST Summary

m/z

Experimental MassXcorr (>2.0)

DeltCn (>0.2)Sp (Preliminary Score)

Rsp (< 10)

SEQUEST Summary

m/z

Matching Ions (>70%)Accession Number

Database OccurrancesPeptide

Page 48: Predicting Protein Sequences From Mass Spectral Data

SEQUEST Summary

m/zA “Prevalent” Protein

How many times the protein appeared in the SEQUEST output files in the 1st (top scoring) position, 2nd position, ..., down to the 5th position.

Consensus Score =10x8 + 8x1 + 6x0 +4x0 + 2x0 + 1x0= 88

SEQUEST

m/z

• Popular

• Uses heuristics to score results

• Output is complicated, requires user input to assess the validity of a result.

• Confidence cannot be assessed numerically.

Page 49: Predicting Protein Sequences From Mass Spectral Data

PeptideProphet and Protein Prophet

PeptideProphet

• Reads in SEQUEST summary HTML files.

• http://peptideprophet.sourceforge.net/

Page 50: Predicting Protein Sequences From Mass Spectral Data

PeptideProphet

• Validates peptide assignments to MS/MS spectra from SEQUEST (and others).

• Looks at search scores and peptide properties among correct and incorrect peptides:

– Number of termini compatible with enzymatic cleavage (for unconstrained searches)

– Mass differences WRT the precursor ion

• Uses those distributions to compute a probability that it is correct

PeptideProphet • Performed an experiment to identify SEQUEST hits

and misses:

• Prepared a sample of 18 control proteins from various organisms (from bovine, chicken, rabbit, E. coli, S. Cerevisiae, and B. lichenformis).

– Appended the database sequences for the control proteins to a database of Drosophila proteins.

– Searched the modified database with the control protein MS-MS spectra and SEQUEST.

– All identifications from Drosophila are 'misses'.

Page 51: Predicting Protein Sequences From Mass Spectral Data

PeptideProphet • Performed discriminant function analysis to weight the

var ious SEQUEST scores according to their ability to discr iminate hits from misses.F � XCorr ,RankSP, Ions, � Cn, � Mass �=c0 � c1 XCorr � c2 RankSP � c3 Ions � c4 � Cn � c5 � Mass

• Actually used a transformation of the XCorr score to achieve better discr imination, reduce peptide length dependence on XCorrF � XCorr ,RankSP, Ions, � Cn, � Mass �= c0 � c1 XCorr ' � c2 RankSP � c3 Ions � c4 � Cn � c5 � Mass

XCorr ' �ln � XCorr �ln � NL � , if L � Lc

ln � XCorr �ln � NC � , if L � LC

L = # aa, NL = # expected frag. ions

LC = Xcorr independence threshold

NC = Corresponding exp. frag. threshold

PeptideProphet

• Plotted the positive and negative hits as a function of the discriminant score.

P F + � � 1�2 ��� e�

� F���

� 22 � 2

P � F � - ��� � F ����� �"! 1 e! F !$#%& � T �('��

Page 52: Predicting Protein Sequences From Mass Spectral Data

PeptideProphet • The probability of getting a correct

result, given the discriminant score is calculated using our old friend, the Bayes' Law:

P � + � F ��� P � F � + � P � + �P � F � + � P � + ��� P � F � - � P � - �

P � + �F �1�

2 �� e��� F �����2

2 � 2 � Total Correct �1�

2 �� e� � F �����2

2 � 2 � Total Correct ��� � F ����� � � 1e� F ��� ! � T �#"$� � Total Incorrect �

PeptideProphet • Adding extra information to improve

the score: Number of tryptic termini (NTT)– The majority of correctly assigned

peptides have 2 tryptic termini:• A.KMCDPTYR.F

– The majority of incorrectly assigned peptides have 0 tryptic termini

• AGMCDPTYHF

• This information can be used to improve the score

Page 53: Predicting Protein Sequences From Mass Spectral Data

PeptideProphet

• Examine the training set data for relationship between predictions and NTT:– Correct: NTT0 = .03, NTT1 = .28, NTT2 = .69

– Incorrect: NTT0 = .80, NTT1 = .19, NTT2 = .01

• Modify the scoring scheme, eg for NTT=2:

P � + � F �1�

2 � e� � F �����2

2 � 2 � Total Correct � � 0.69

1�2 �� e� � F ����

2

2 � 2 � Total Correct � � 0.69 � � F � � � � � 1 e� F � � ! � T � " � � Total Incorrect ��� 0.01

PeptideProphet

• PeptideProphet uses an Expectation Maximization Algorithm to adjust the probabilities of correct and incorrect assignments from the training set to real datasets.

Page 54: Predicting Protein Sequences From Mass Spectral Data

PeptideProphet

PeptideProphet

Page 55: Predicting Protein Sequences From Mass Spectral Data

PeptideProphet

PeptideProphet

http://www.proteomecenter.org/course/20040113-Day2.pdf

Page 56: Predicting Protein Sequences From Mass Spectral Data

ProteinProphet

• Reads in ProteinProphet results

• Calculates the probability that the peptides identified from PeptideProphet correspond to identified proteins from a protein database.

• http://proteinprophet.sourceforge.net/

ProteinProphetAssuming each peptide assignment to a spectrum is considered independent evidence for its corresponding protein, the protein probability can be calculated as:

P � 1 ��

i

�1 � maxj p

�+ � Di

j ���

Page 57: Predicting Protein Sequences From Mass Spectral Data

ProteinProphetAdjusting for observed peptide grouping:

Correct peptide assignments tend to correspond to “multihit” proteinsIncorrect peptide assignments tend to correspond to proteins with no other hits.

MRNSYRFLASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRKTICAGALIKDWVLTAAHCNLNRSQVILGAHSITTTYEEPTKQIMLVKKEFPYPCYDPATREGDLKLL

MASMGTLAFDEYGRPFLIIKDQDRKSRLMGLEALKSHIMAAKAVANTMRTSLGPNGLDKMMVDKDGDVTVTNDGATILSMMDVDHQIAKLMVELSKSQDD EIGDGTTGVVVLAGALLEE

NSPi � ��m�m � i � P � + � Dm �

IIGGNEVTPHSR = .91TICAGALIK = .65ITTTYEEPTK = .85

NSP(EGDLK) = .91 + .65 + .85 = 2.41

ProteinProphetAdjusting for observed peptide grouping:

NSPi � ��m �m � i � P � + � Dm �

p � + � D ,NSP �� p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �

D number of tryptic termini, database search scores, number of missed cleavages, etc.

p(+ | D) the peptide probability scores from PeptideProphet

Page 58: Predicting Protein Sequences From Mass Spectral Data

ProteinProphet

p � + � D ,NSP �� p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �

P(+ | D,NSP) The probability that the peptide assignment is correct, given the Data and # sibling peptides

P(NSP | +) The probability of having a particular NSP value, according to the distribution of correct peptide assignments

P(NSP | -) The probability of having a particular NSP value, according to the distribution of incorrect peptide assignments.

ProteinProphet

p � + � D ,NSP �� p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �

To calculate the various NSP-related distributions, the NSP values are made discrete by placing them into bins. The probability that a correctly assigned peptide has an NSP value in bin k is computed by summng over the peptide values in bin k.

0-0.5 0.5-1 1- 1.5 1.5-2 2-2.5

��������� ���� ������������� � � ������� �� "! ���#� $ � ����� � �

N the total number of peptide assignments in bin k

Page 59: Predicting Protein Sequences From Mass Spectral Data

ProteinProphet

p � + � D ,NSP �� p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �

� � ��� � ��� ������#� � �� � � ������� �� "! ���#� $ � � � � � �

N The total number of peptide assignments in bin k

P(+) The prior probability of a correct peptide assignment

����������� �� ������� � � ����� � �Computed by summing over all peptides i:

ProteinProphet

p � + � D ,NSP �� p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �

� � ��� � � �� ������ � ���� � � ������� �� ! � � � $ � ��� � � �

The NSP distributions for incorrect assignments is computed analogously.

���������� �� ������� � � ��� � � �

Page 60: Predicting Protein Sequences From Mass Spectral Data

ProteinProphet

p � + � D ,NSP � � p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �

� � ����� � �� �� ��� � ���� � � ������� �� ! � � � $ � ��� � � ����������� �� ����� � � � ��� � � �

� � ����� � �� ������ ��� �� � � � ����� �� ! � � � $ � ��� � � �

� ��� �� � � � ��� � � � ��� � � �

ProteinProphet

• NSP distributions will change from sample to sample due to data set size, protein sequence database, proteins in the sample set, data quality etc.

• The EM algorithm is used to find p(NSP | +) and p(NSP | -)

Page 61: Predicting Protein Sequences From Mass Spectral Data

ProteinProphet

Inorrect NSP Values Correct NSP Values

ProteinProphet

• Degenerate PeptidesMRNSYRFLASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRKTICAGALIKDWVLTAAHCNLNRSQVILGAHSITTTYEEPTKQIMLVKKEFPYPCYDPATREGDLKLLEE

MASMGTLAFDEYGRPFLIIKDQDRKSRLMGLEALKSHIMAAKPYMVLLSLDRKAVANTMRTSLGPNGLDKMMVDKDGDVTVTNDGATILSMMDVDHQIAKLMVELSKSQDDEIGDGTTGV

� Some peptides assigned from MS/MS spectra can be found in more than one protein, thus they are 'degenerate'.

� How does one figure out which is the true corresponding protein?

Page 62: Predicting Protein Sequences From Mass Spectral Data

ProteinProphet

������ � ����� ����� � �

Weight the peptides according to the probability of that protein being in the sample

Peptide i corresponds to Ns different

proteins, the relative weight wni that

this peptide actually corresponds to protein n (n= 1... Ns) is determined according to the probability of protein n relative those of all Ns proteins:

ProteinProphet

� �� � � ����� ����� � �

The Protein probability function is then modified to account for degeneracy

P � 1 ��

i

�1 � wi

n maxj p�+ � Di NSPi

n ���

Page 63: Predicting Protein Sequences From Mass Spectral Data

ProteinProphetProtein Probability NSP-adjusted peptide prob

Original Probability

# tryptic termini

NSPs# peptides inNSP Bin

Shared peptide weight

Protein Coverage