I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf ·...

34
Contents I This is a Part 7 1 Statistical methods for classifying mass spectrometry database search results 9 Zhaohui S. Qin, Peter J. Ulintz, Ji Zhu, Philip Andrews Department of Biostatistics, Bioinformatics Program, Department of Statistics, National Resource for Proteomics and Pathways, University of Michigan, Ann Arbor, MI 48109 1.1 Introduction ........................... 9 1.2 Background on proteomics ................... 11 1.3 Classification methods ...................... 13 1.3.1 Mixture Model Approach in Peptide Prophet ..... 15 1.3.2 Machine learning techniques ............... 16 1.4 Data and implementation .................... 18 1.4.1 Reference Datasets .................... 18 1.4.2 Dataset Parameters ................... 19 1.4.3 Implementation ...................... 22 1.5 Results and Discussion ..................... 23 1.5.1 Classification Performance on the ESI-SEQUEST dataset 23 1.5.2 The impact of individual attributes on the final predic- tion accuracy ........................ 27 1.6 Conclusions ............................ 28 1.7 Acknowledgement ........................ 29 References 31 0-8493-0052-5/00/$0.00+$.50 c 2001 by CRC Press LLC 1

Transcript of I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf ·...

Page 1: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Contents

I This is a Part 7

1 Statistical methods for classifying mass spectrometry databasesearch results 9Zhaohui S. Qin, Peter J. Ulintz, Ji Zhu, Philip Andrews Department of

Biostatistics, Bioinformatics Program, Department of Statistics, National Resource

for Proteomics and Pathways, University of Michigan, Ann Arbor, MI 48109

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Background on proteomics . . . . . . . . . . . . . . . . . . . 111.3 Classification methods . . . . . . . . . . . . . . . . . . . . . . 13

1.3.1 Mixture Model Approach in Peptide Prophet . . . . . 151.3.2 Machine learning techniques . . . . . . . . . . . . . . . 16

1.4 Data and implementation . . . . . . . . . . . . . . . . . . . . 181.4.1 Reference Datasets . . . . . . . . . . . . . . . . . . . . 181.4.2 Dataset Parameters . . . . . . . . . . . . . . . . . . . 191.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . 22

1.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 231.5.1 Classification Performance on the ESI-SEQUEST dataset

231.5.2 The impact of individual attributes on the final predic-

tion accuracy. . . . . . . . . . . . . . . . . . . . . . . . 271.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281.7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . 29

References 31

0-8493-0052-5/00/$0.00+$.50c© 2001 by CRC Press LLC 1

Page 2: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies
Page 3: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

List of Tables

1.1 Number of spectra examples for the ESI-SEQUEST dataset. . 191.2 SEQUEST attribute descriptions. Attribute names in bold are

treated as discrete categorical variables. . . . . . . . . . . . . 21

0-8493-0052-5/00/$0.00+$.50c© 2001 by CRC Press LLC 3

Page 4: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies
Page 5: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

List of Figures

1.1 Schematic Diagram of a typical proteomics experiment. Pro-teins are extracted from a biological sample, digested and frac-tionated (separated) prior to introduction into the mass spec-trometer. Spectra generated by the mass spectrometer are in-terpreted using algorithms which compare them to amino acidsequences from a database, returning a list of database pep-tides which best match each spectrum. Peptide predictions arevalidated and combined to produce a list of proteins predictedto be present in the biological sample. . . . . . . . . . . . . . 11

1.2 Example of Sequest search result (*.out file): A Sequest reportresulting from the search of an MS/MS peaklist against a se-quence database. In addition to search parameter informationand some background statistics of the search (the top portionof the report), a ranked list of matching peptides is provided.Each row in this list is a potential identification of a peptidethat explains the mass spectrum, the top match being the pep-tide that the algorithm believes is the best match. Variousscores or attributes are provided for each potential ”hit”: Sp,XCorr, deltCn, etc. These attributes, or additional attributescalculated based on the information provided for each hit, arethe primary data being used in this work. . . . . . . . . . . . 14

0-8493-0052-5/00/$0.00+$.50c© 2001 by CRC Press LLC 5

Page 6: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

6

1.3 Performance of boosting and random forest methods on theESI-SEQUEST dataset. A. ROC plot of classification of thetest set by PeptideProphet, SVM, boosting, and random forestmethods using attribute groups I and II. The plot on the right isa blowup of the upper left region of the figure on the left. Alsodisplayed are points corresponding to several sets of SEQUESTscoring statistics used as linear threshold values in publishedstudies. The following criteria were applied for choosing correcthits, the +1, +2, +3 numbers indicating peptide charge: a) +1:XCorr ≥ 1.5, NTT = 2; +2, +3: XCorr ≥ 2.0, NTT = 2; b)∆Cn > 0.1, +1: XCorr ≥ 1.9, NTT = 2; +2: XCorr ≥ 3 OR2.2 ≤ XCorr ≤ 3.0, NTT ≥ 1; +3: XCorr ≥ 3.75, NTT ≥ 1;c) ∆Cn ≥ 0.08, +1: XCorr ≥ 1.8; +2: XCorr ≥ 2.5; +3:XCorr ≥ 3.5; d) ∆Cn ≥ 0.1, +1: XCorr ≥ 1.9, NTT = 2;+2: XCorr ≥ 2.2, NTT ≥ 1; +3: XCorr ≥ 3.75, NTT ≥ 1;e) ∆Cn ≥ 0.1, Sp Rank ≤ 50, NTT ≥ 1, +1: not included;+2: XCorr ≥ 2.0; +3: XCorr ≥ 2.5. It can be seen thatall machine learning approaches provide a significant improve-ment over linear scoring thresholds. B. Results of the randomforest method using various sets of attributes. The black linerepresents the result of the random forest using six attributesdefined in Table 2 as groups I and II: the SEQUEST XCorr,Sp Rank, ∆Cn, Delta Parent Mass, Length, and Number ofTryptic Termini (NTT). The red line is the result using four-teen attributes, groups I, III, and IV (no NTT). The blue linerepresents the result using all attribute groups I-IV, all fifteenvariables. C. ROC plot of the boosting method using attributegroups I and II (black); I, III, and IV (red); and I-IV (green). 24

1.4 Relative importance of data attributes used for classificationby boosting and random forest methods. A. SEQUEST at-tribute importance from boosting and random forest classifi-cation using attribute groups I and II. B. SEQUEST attributeimportance using attribute groups I, III, and IV. C. SEQUESTattribute importance using all attributes in random forest andboosting methods. . . . . . . . . . . . . . . . . . . . . . . . . 25

Page 7: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Part I

This is a Part

7

Page 8: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies
Page 9: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Chapter 1

Statistical methods for classifyingmass spectrometry database searchresults

Zhaohui S. Qin, Peter J. Ulintz, Ji Zhu, Philip Andrews

Department of Biostatistics, Bioinformatics Program, Department ofStatistics, National Resource for Proteomics and Pathways, University ofMichigan, Ann Arbor, MI 48109

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Background on proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3 Classification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4 Data and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281.7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.1 Introduction

Proteomics, the large scale analysis of proteins, holds great promise to en-hance our understanding of cellular biological process in normal and diseasetissue (Pennington and Dunn 2005). The proteome is defined as the com-plete set of proteins expressed by a cell, tissue or organism. Transcript levelanalysis, as typically measured by DNA microarray technologies (Schena etal. 1995, Lockhart et al. 1995), does not provide complete information onthe proteome in that the DNA transcriptional template of an organism isstatic, whereas the proteome is constantly changing in response to environ-mental signals and stress. Recent studies have been unable to establish aconsistent relationship between the transcriptome and proteome (Andersonand Seihamer 1997, Pandey and Mann 2000). The main reason is that tran-script profiles do not provide any information on posttranslational modifica-tions of proteins (such as phosphorylation and glycosylation) that are crucialfor protein transport, localization, and function. Studies have shown thatthe correlation between mRNA levels and protein abundance was poor forlow expression proteins (Gygi et al. 1999). Posttranslational modificationssuch as glycosylation, phosphorylation, ubiquitination, proteolysis producesfurther modifications, which results in changes in protein abundance.

The ultimate goal of proteomics is to identify and characterize all proteins

0-8493-0052-5/00/$0.00+$.50c© 2001 by CRC Press LLC 9

Page 10: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

10Statistical methods for classifying mass spectrometry database search results

expressed in cells collected from a variety of conditions to facilitate com-parisons (Panley and Mann 2000, Aebersold and Goodlett 2001). Althoughtranscriptome data such as those from microarray studies reflect the genome’sobjectives for protein synthesis, they do not provide information about thefinal results of such objectives. Proteomics analysis provides a more accurateview of biological porcesses, thus offering a better understanding of the physi-ologic and pathologic states of an organism, and serving as an important stepin the development and validation of diagnostic and therapeutics.

In the fast-advancing proteomics field, increasingly high-throughput assaysfor the identification and characterization of proteins are being developed.Mass spectrometry is the primary experimental method for protein identifi-cation; tandem mass spectrometry (MS/MS) in particular is now the de facto

standard identification technology, providing the ability to rapidly charac-terize thousands of peptides in a complex mixture. In such assays, proteinsare first digested into smaller peptides, and subjected to reverse-phase chro-matography. Pepetides are then ionized and fragmented to produce signatureMS/MS spectra that are used for identification. More details about this tech-niques can be found in Link et al. 1999, Washburn et al. 2001.

Instrument development continues to improve the sensitivity, accuracy andthroughput of analysis. Current instruments are capable of routinely gen-erating several thousand spectra per day, detecting sub-femtomolar levels ofprotein at better than 10 ppm mass accuracy. Such an increase in instru-ment performance is limited, however, without effective tools for automateddata analysis. In fact, the primary bottleneck in high-throughput proteomicsproduction ”pipelines” is in many cases the quality analysis and interpreta-tion of the results to generate confident protein assignments. This bottleneckis primarily due to the fact that it is often difficult to distinguish true hitsfrom false positives in the results generated by automated mass spectrometrydatabase search algorithms. In most cases, peptide identifications are madeby searching MS/MS spectra against a sequence database to find the bestmatching database peptide (Johnson et al. 2005). All MS database searchapproaches produce scores describing how well a peptide sequence matchesexperimental data, yet classifying hits as ”correct” or ”incorrect” based ona simple score threshold frequently produces unacceptable false positive/falsenegative rates. Consequently, manual validation is often required to be trulyconfident in the assignment of a database protein to a spectrum. Designand implementation of an efficient algorithm to automate the peptide iden-tification process is of great interest and presents a grand challenge for thestatsitician and bioinformatician in the post-genomics era.

An array of strategies for automated and accurate spectral identificationranging from heuristics to probabilistic models has been proposed (Eng et al.1994, Perkins et al. 1999, Clauser et al. 1999, Bafna and Edwards 2001, Hav-ilio et al. 2003, Craig and Beavis 2004, Geer et al. 2004). Discrimination ofcorrect and incorrect hits is thus an ongoing effort in the proteomics commu-nity (Moore et al. 2002, MacCoss et al. 2002, Zhang et al. 2002, Keller et al.

Page 11: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Statistical methods for classifying mass spectrometry database search results 11

FIGURE 1.1: Schematic Diagram of a typical proteomics experiment. Pro-teins are extracted from a biological sample, digested and fractionated (sepa-rated) prior to introduction into the mass spectrometer. Spectra generated bythe mass spectrometer are interpreted using algorithms which compare themto amino acid sequences from a database, returning a list of database peptideswhich best match each spectrum. Peptide predictions are validated and com-bined to produce a list of proteins predicted to be present in the biologicalsample.

2002a, Sadygov and Yates 2003, Fenyo and Beavis 2003, Anderson et al. 2003,Eriksson and Fenyo et al. 2004, Sun et al. 2004, Ulintz et al. 2006), with theultimate goal being completely automated MS data interpretation. In thischapter, we intend to provide a concise and up to date review of automatedtechniques for classification of database search results, with more details ona probabilistic approach and a recent machine learning-based approach. Theremaining part of this chapter is organized as follows. In Section 2, we providea concise overview of the current proteomics technologies. In section 3, wesummarize all the methods proposed for classifying database search results.in Section 4, we describe a dataset and the detailed procedure of analyzing it.In Sections 5 and 6, we close with concluding remarks and discussions.

1.2 Background on proteomics

A typical proteomics experiment involves analyzing a biological sample toidentify the proteins present. Generating a list of protein identifications fol-lows a somewhat standard set of steps that are now briefly outlined in Figure1.1 (please see Steen and Mann 2004 for a general review).

Page 12: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

12Statistical methods for classifying mass spectrometry database search results

An experiment begins by extracting proteins from a biological sample. Thiscomplex mixture of proteins is reduced to disrupt the tertiary structure of theindividual proteins, the cysteines are blocked to prevent their refolding, andthe proteins are digested with a proteolytic enzyme– typically trypsin– intoshorter peptide sequences to render the mixture more amenable to mass spec-trometry. This complex mixture of peptides must be fractionated into simplermixtures for introduction into the mass spectrometer. There are a variety ofmethods for resolving complex peptide mixtures, all exploiting different phys-ical properties of individual peptides, e.g. size, charge, hydrophobicity, orisoelectric point (pI). Two-dimensional separation of peptide mixtures via liq-uid chromatography (2D-LC) is now standard. In such an experiment, themultiple simplified fractionated mixtures are analyzed individually in the massspectrometer and the results combined to produce a final dataset of evidencefor the presence of all proteins in the mixture.

Once a peptide mixture from 2D-LC is introduced, the mass spectrometermeasures the masses of the individual peptides. A measurement of the intactmass of an individual peptide is rarely sufficient to identify it unambiguously,therefor tandem mass spectrometry (MS/MS) is typically employed. Tandemmass spectrometry involves two sequential measurements. In the first stage,the masses of all peptides in the mixture are measured as described above.From this measurement, very narrow mass ranges corresponding to these in-dividual ”precursor” masses may be selected for further analysis. These se-lections, ideally consisting of a homogeneous mixture of identical peptides,undergo further fragmentation. The fragmentation is induced by adding en-ergy to the isolated peptides, typically by collision with an inert gas in aprocess known as collision-induced dissociation (CID). The additional energycauses these peptides to fragment in a predictable manner. The masses ofthese fragments are detected, producing the tandem mass spectrum. Thisspectrum may be interpreted to infer the amino acid sequence of the peptidefrom which it was produced. A raw tandem mass spectrum may be inter-preted manually, but this process is time-consuming; a typical mass spectrumrequires 15-45 minutes to interpret by a trained expert. Consequently, theaforementioned algorithms such as Sequest (Eng et al. 1994) and Mascot(Perkins et al. 1999)are typically used to interpret the spectra automatically.These algorithms make use of a sequence database, a list of protein aminoacid sequences, although it must be noted that a number of algorithms existwhich attempt to infer the peptide sequence directly from spectra ”de-novo”without the use of a sequence database (see Johnson et al. 2005 for a generalreview of both standard ”uninterpreted” and ”de-novo” methods). The pri-mary goal of the database search algorithms is to return the peptide sequencein the database which best explains the experimental spectrum. A significantamount of attention has been paid over the past fourteen years to algorithmsand heuristics for ranking and scoring sequence/spectrum matches, but mostfollow the same underlying principal: for each peptide in the database, thealgorithms generate a theoretical tandem mass spectrum and match this the-

Page 13: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Statistical methods for classifying mass spectrometry database search results 13

oretical spectrum to the experimental spectrum, producing a score for thematch. The result of performing this matching process is a ranked list ofamino acid sequences which best explain the experimental spectrum (see Fig-ure 1.2). Each experimental spectrum generated by the instrument will havethis associated list of search results. If successful, the top-scoring result or ’hit’will be the correct one. However, the top-scoring hit is frequently incorrect.Therefor, the fundamental problem becomes one of identifying which hits arecorrect given the data returned by the search algorithm for each experimentalmass spectrum.

1.3 Classification methods

The most straightforward approach to automated analysis of the resultsof mass spectrometry database search result is to define specific score-basedfiltering thresholds as discriminators of correctness, e.g. accepting Sequestscores of doubly-charged fully tryptic peptides with XCorr greater than 2.2and delta Cn values at least 0.1 (Washburn et al. 2001); these thresholdsare typically published as the criteria for which correctness is defined. Otherefforts have focused on establishing statistical methods for inferring the like-lihood that a given hit is a random event. A well known example of this isthe significance threshold calculated by the Mascot search algorithm, whichby default displays a threshold indicating the predicted probability of an as-signment being greater than 5% likely to be a false positive based on thesize of the database. Use of a reverse database search to provide a measureof false positive rate is another method frequently used (Moore et al. 2002,Peng et al. 2003). More formally, Sadygov and Yates model the frequency offragment ion matches from a peptide sequence database matching a spectrumas a hypergeometric distribution (Sadygov and Yates 2003), a model also in-corporated into the openly available X!Tandem algorithm (Craig and Beavis2004, Fenyo and Beavis 2993); while Geer et al. model this distribution as aPoisson distribution (Geer et al. 2004).

Several of these approaches have been implemented directly in the scoringcalculation of new search algorithms (Craig and Beavis 2004, Geer et al. 2004,Craig and Beavis 2004, Eriksson and Fenyo 2004). Alternatively, externalalgorithms may be developed that process the output of the more standardsearch platforms such as Mascot or Sequest, classifying results as either corrector incorrect with an associated probability. Examples of the latter type includePeptide Prophet (Keller et al. 2002a) and QScore (Moore et al. 2002). Thesetools have the advantage of being able to accommodate results from theseexisting, well-established search engines that may already be in place in aproduction lab; conversely, approaches in which the quality measures are built

Page 14: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

14Statistical methods for classifying mass spectrometry database search results

FIGURE 1.2: Example of Sequest search result (*.out file): A Sequestreport resulting from the search of an MS/MS peaklist against a sequencedatabase. In addition to search parameter information and some backgroundstatistics of the search (the top portion of the report), a ranked list of matchingpeptides is provided. Each row in this list is a potential identification of apeptide that explains the mass spectrum, the top match being the peptidethat the algorithm believes is the best match. Various scores or attributes areprovided for each potential ”hit”: Sp, XCorr, deltCn, etc. These attributes,or additional attributes calculated based on the information provided for eachhit, are the primary data being used in this work.

Page 15: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Statistical methods for classifying mass spectrometry database search results 15

into the search algorithm scoring are arguably more user-friendly in that theyeliminate the extra post-search processing step of having to run a secondalgorithm.

Keller et al were among the first to implement a generic tool for classi-fying the results of common search algorithms as either correct or incorrect(Keller et al. 2002a). Their PeptideProphet tool represents arguably the mostwidely used openly-available package implementing a probabilistic approachto assess the validity of peptide assignments generated by MS database searchalgorithms. Their approach contains elements of both supervised and unsu-pervised learning, achieves a much higher sensitivity than conventional meth-ods based on simple scoring thresholds. One concern with PeptideProphet,however, is the degree to which the supervised component of the model canbe generalized to new types of data and the ease with which new potentiallyuseful information can be added to the algorithm.

Ulintz et al. attempts to address these difficulties by applying a set ofstandard ”over the counter” methods to the challenging peptide identificationproblem (Ulintz et al. 2006). Anderson et al demonstrated that support vec-tor machines could perform well on ion-trap spectra searches using the Sequestalgorithm (Anderson et al. 2003). Our approaches, based on the latest ma-chine learning techniques, extend this idea, providing further support for theflexibility and generality of using machine learning tools with these data. Wefocus on establishing the effectiveness of two statistical pattern classificationapproaches– boosting and random forests– at improving peptide identifica-tion, even in cases in which individual pieces of information obtained from aparticular search result are not completely independent or strongly discrimina-tory (but are easily obtained). Such work will hopefully result in developmentof software tools that are easily installed in a production laboratory settingthat would allow convenient filtering of false identifications with an acceptablyhigh accuracy, either as new tools or as a compliment to currently existingsoftware. The problem of classification of mass spectrometry-based peptideidentification, a binary classification problem, seems well suited to these algo-rithms and could lead to more readily-usable software for automated analysisof the results of mass spectrometry experiments.

1.3.1 Mixture Model Approach in Peptide Prophet

Among all the methods that have been proposed in the literature for thepeptide identification problem, the mixture model approach implemented inthe PeptideProphet algorithm (Keller et al. 2002a) is perhaps the best known.In this method, a discriminant score function F (x1, x2, . . . , xS) = c0 + c1x1 +· · ·+cSxS is defined to combine database search scores x1, x2, ..., xS where ci’sare weights. Based on a training dataset, a Gaussian distribution is chosen tomodel the discriminant scores corresponding to correct peptide assignments,and a Gamma distribution is selected to model the asymmetric discriminantscores corresponding to incorrect peptide assignments. All the scores are

Page 16: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

16Statistical methods for classifying mass spectrometry database search results

therefore represented by a mixture model

p(x) = rf1(x) + (1 − r)f2(x),

where f1(x) and f2(x) represent the density functions of the two types ofdiscriminant scores, and r is the proportion of correct peptide identification.For each new test dataset, the EM algorithm (Dempster et al. 1997) is usedto estimate the probability that the peptide identified is correct. A decisioncan be made by comparing the probability to a pre-specified threshold. Whencompared to conventional means of filtering data based on Sequest scores andother criteria, the mixture model approach achieves much higher sensitivity.

A crucial part of the above approach is the choice of discriminant scorefunction F . In Keller et al. 2002a, the ci’s are derived in order to maximizethe between- versus within-class variation under the multivariate normal as-sumption using training data with peptides assignments of known validity.To make this method work, one has to assume that the training data andthe test data are generated from the same source. In other words, when anew set of discriminant scores is generated requiring classification, one hasto retrain the weight parameters ci’s using a new corresponding training set;the discriminant function F is data dependent. In an area such as proteomicsin which there is a good amount of heterogeneity in instrumentation, proto-col, database, and database searching software, it is fairly common to comeacross data which display significant differences. Its unclear to what degreethe results of a classification algorithm are sensitive to these differences, henceit is desirable to automate the discriminant function training step. Anotherpotential issue is the normal and Gamma distributions used to model thetwo types of discriminant scores. There is no theoretical explanation why thediscriminant scores should follow these two distributions; in fact, a Gammadistribution rather than a normal distribution may be appropriate for bothpositive and negative scores when using the Mascot algorithm (Perkins etal. 1999). It is likely that for a new set of data generated by different massspectrometers and/or different search algorithms, the two distributions maynot fit the discriminant scores well. If they are generically applied, significanthigher classification errors may be produced.

1.3.2 Machine learning techniques

Distinguishing correct from incorrect peptide assignments can be regardedas a classification problem, or supervised learning, a major topic in the sta-tistical learning field. Many powerful methods have been developed such asCART, SVM, random forest, boosting, and bagging (Hastie et al. 2001).Each of these approaches has some unique features that enable them to per-form well in certain scenarios; SVMs, for example, are an ideal tool for smallsample size, large feature space situations. On the other hand, all approachesare quite flexible and have been applied to an array of biomedical problems,

Page 17: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Statistical methods for classifying mass spectrometry database search results 17

from classify tissue types using microarray data (Brown et al. 2000) to pre-dict functions of single nucleotide polymorphisms (Bao and Cui, 2005). Inthis chapter, we apply state-of-the-art machine learning approaches to thepeptide assignment problem.

BoostingThe boosting idea, first introduced by Freund and Schapire with their Ad-aBoost algorithm (Freund and Schapire 1995), is one of the most powerfullearning techniques introduced during the past decade. It is a procedure thatcombines many ”weak” classifiers to achieve a final powerful classifier. Herewe give a concise description of boosting in the two-class classification set-ting. Suppose we have a set of training samples, where xi is a vector of inputvariables-in this case, various scores and features of an individual MS databasesearch result produced from an algorithm such as Sequest (see Figure 1.2) andyi is the output variable coded as -1 or 1, indicating whether the sample is anincorrect or correct assignment of a database peptide to a spectrum. Assumewe have an algorithm that can build a classifier T (x) using weighted train-ing samples so that, when given a new input x, T (x) produces a predictiontaking one of the two values −1, 1. Then boosting proceeds as follows: startwith equal weighted training samples and build a classifier T1(x); if a trainingsample is misclassified, i.e. an incorrect peptide is assigned to the spectrum,the weight of that sample is increased (boosted). A second classifier T2(x) isthen built with the training samples, but using the new weights, no longerequal. Again, misclassified samples have their weights boosted and the pro-cedure is repeated M times. Typically, one may build hundreds or thousandsclassifiers this way. A final score is then assigned to any input x, defined tobe a linear (weighted) combination of the classifiers, where w indicates therelative importance of each classifier. A high score indicates that the sampleis most likely a correctly assigned protein with a low score indicating that itis most likely an incorrect hit. By choosing a particular value of the score asa threshold, one can select a desired specificity or a desired ratio of correct toincorrect assignments. In this work, we use decision trees with 40 leaves forthe ’weak’ classifier, and fix M equal to 1000.

Random ForestsSimilar to boosting, the random forest (Breiman 2001) is also an ensemblemethod that combines many decision trees. However, there are three primarydifferences in how the trees are grown: 1. Instead of assigning different weightsto the training samples, the method randomly selects, with replacement, nsamples from the original training data; 2. Instead of considering all inputvariables at each split of the decision tree, a small group of input variableson which to split are randomly selected; 3. Each tree is grown to the largestextent possible. To classify a new sample from an input, one runs the inputdown each of the trees in the forest. Each tree gives a classification (vote).The forest chooses the classification having the most votes over all the treesin the forest. The random forest enjoys several nice features: it is robustwith respect to noise and overfitting, and it gives estimates of what variables

Page 18: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

18Statistical methods for classifying mass spectrometry database search results

are important in the classification. A discussion of the relative importanceof the different parameters used in our analysis of MS search results is givenin the results section. The performance of the random forest depends onthe strength of the individual tree classifiers in the forest and the correlationbetween them. Reducing the number of randomly selected input variables ateach split reduces both the strength and the correlation; increasing it increasesboth. Somewhere in between is an optimal range - usually quite wide.

Support vector machinesThe support vector machine (SVM) is another successful learning techniqueintroduced in the past decade (Vapnik 1999). It typically produces a non-linear classification boundary in the original input space by constructing alinear boundary in a transformed version of the original input space. Thedimension of the transformed space can be very large, even infinite in somecases. This seemingly prohibitive computation is achieved through a positivedefinite reproducing kernel, which gives the inner product in the transformedspace. The SVM also has a nice geometrical interpretation in the finding of ahyperplane in the transformed space that separates two classes by the biggestmargin in the training samples, although this is usually only an approximatestatement due to a cost parameter. The SVM has been successfully applied todiverse scientific and engineering problems, including bioinformatics (Jaakkolaet al., 1999; Brown et al., 2000; Furey et al., 2000). Anderson et al. (2003)introduced the SVM to MS/MS spectra analysis, classifying Sequest resultsas correct and incorrect peptide assignments. Their result indicates that theSVM yields less false positives and false negatives compared to other cutoffapproaches.

However, one weakness of the SVM is that it only estimates the categoryof the classification, while the probability p(x) is often of interest itself, wherep(x) = P (Y = 1|X = x) is the conditional probability of a sample being inclass 1 (i.e. correctly identified peptide) given the input x. Another problemwith the SVM is that it is not trivial to select the best tuning parametersfor the kernel and the cost. Often a grid search scheme has to be employed,which can be time consuming. In comparison, boosting and the random forestare very robust, and the amount of tuning needed is rather modest comparedwith the SVM.

1.4 Data and implementation

1.4.1 Reference Datasets

In the remainder of this chapter, we present performance comparison re-sults from the aforementioned algorithms on a published dataset of Keller etal. 2002b where the protein content is known. In particular, we intend to

Page 19: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Statistical methods for classifying mass spectrometry database search results 19

TABLE 1.1: Number ofspectra examples for theESI-SEQUEST dataset.

Training Testing

Correct 1930 827Incorrect 24001 10286

benchmark the performance of boosting and random forest methods in com-parison with other approaches using a ”gold-standard” dataset. The assayused to generate this dataset represents one of the two most common pro-tein MS ionization approaches: electrospray ionization (ESI) (the other beingMatrix- Assisted Laser Desorption ionization (MALDI)). The results are se-lected and summarized from our previous publication (Ulintz et al. 2006).This dataset is referred to as the ESI-Sequest dataset from here on.

The electrospray dataset was kindly provided by Andy Keller, as describedin (Keller et al. 2002a and b). These data are combined MS/MS spectra gen-erated from 22 different LC/MS/MS runs on a control sample of 18 known(non-human) proteins mixed in varying concentrations. A ThermoFinniganion trap mass spectrometer was used to generate the dataset. In total, thedata consists of 37,044 spectra of three parent ion charge states: [M + H ]+,[M +2H ]2+ and [M +3H ]3+. Each spectrum was searched by Sequest againsta human protein database with the known protein sequences appended. Thetop-scoring peptide hit was retained for each spectrum; top hits against theknown eighteen proteins were labeled as ”correct”, and manually verified byKeller et al. All peptide assignments corresponding to proteins other thanthe eighteen in the standard sample mixture and common contaminants werelabeled as ”incorrect”. In all, 2698 (7.28%) peptide assignments were deter-mined to be correct. The distribution of hits are summarized in Table 1.

1.4.2 Dataset Parameters

The attributes extracted from SEQUEST assignments are listed in Table2. Attributes include typical scores generate by the SEQUEST algorithm(preliminary score (Sp), Sp rank, deltaCn, XCorr), as well as other statisticsincluded in a SEQUEST report (total intensity, number of matching peaks,fragment ions ratio). Number of tryptic termini (NTT) is a useful measurefor search results obtained by specifying no proteolytic enzyme, and is usedextensively in (Keller et al. 2002a). Other parameters include features read-ily obtainable from the candidate peptide sequence: C-term residue (K=’1’,R=’2’, others=’0’), number of prolines, and number of arginines. A newstatistic, the Mobile Proton Factor (MPF), is calculated, which attempts toprovide a simple measure of the mobility of protons in a peptide; it is a the-oretical measure of the ease of which a peptide may be fragmented in the

Page 20: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

20Statistical methods for classifying mass spectrometry database search results

gas phase (Wysocki et al. 2000, Kapp et al. 2003, Tabb et al. 2004). Weinclude MPF and the other derived parameters to demonstrate the ease ofaccommodation of additional information into the classification algorithms.

Page 21: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Sta

tisticalm

ethod

sfo

rcla

ssifyin

gm

ass

spectrom

etrydata

base

search

results

21

TABLE 1.2: SEQUEST attribute descriptions. Attribute names in bold are treated as discrete categorical variables.

Attribute Group Attribute Name SEQUEST Name Description

PeptideProphet (I)Delta MH+ (M+H)+ Parent ion mass error between observed and

theoreticalSp Rank Rank/Sp Initial peptide rank based on preliminary scoreDelta Cn deltCn 1 - Cn: difference in normalized correlation scores

between next-best and best hitsXCorr XCorr Cross-correlation score between experimental and

theoretical spectraLength Inferred from Peptide Length of the peptide sequence

NTT (II)NTT Inferred from Peptide Measures whether the peptide is fully tryptic (2),

partially tryptic (1), or non-tryptic (0)

Additional (III)Parent Charge (+1), (+2), (+3) Charge of the parent ionTotal Intensity total inten Normalized summed intensity of peaksDB peptides within # matched peptides Number of database peptides matching the parentmass window peak mass within the specified mass toleranceSp Sp Preliminary score for a peptide matchIon Ratio Ions Fraction of theoretical peaks matched in the

preliminary scoreC-term Residue Inferred from Peptide Amino acid residue at the C-term of the peptide

(1 = R, 2 = ’K’, 0 = ’other’)Number of Prolines Inferred from Peptide Number of prolines in the peptideNumber of Arginines Inferred from Peptide Number of arginines in the peptide

Calculated (IV)Proton Mobility Factor calculated A measure of the ratio of basic amino acids to free

protons for a peptide

Page 22: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

22Statistical methods for classifying mass spectrometry database search results

1.4.3 Implementation

For each of the ESI-Sequest dataset, we construct one balanced trainingand testing dataset by random selection. To be specific, correct-labeled andincorrect-labeled spectra were sampled separately so that both the trainingand the testing datasets contain the same proportion of correctly identifiedspectra. For all results, evaluation was performed on a test set that does notoverlap the training set. Two-thirds of all data were used for training andone-third was used for testing.

The PeptideProphet application used in this analysis was downloaded frompeptideprophet.sourceforge.net. The *.out search result files from Sequestwere parsed, only the top hit for each spectrum was kept. PeptideProphetwas run by executing the runPeptideProphet script using default parameters.

For the boosting and random forest approaches, we used contributed pack-age for the R programming language 1. In general, we did not fine tune theparameters (i.e. tree size, number of trees etc.) of the random forest andboosting for two reasons: classification performances of both the random for-est and boosting are fairly robust to these parameters, and also because ourultimate goal is to provide a portable software tool that can be easily usedin a production laboratory setting. Therefore, we wanted to demonstrate thesuperior performances of these methods in a user-friendly way.

For the AdaBoost (Freund and Schapire 1995) analysis, we used decisiontree with 40 leaves for the weak classifier and fixed the number of boostingiterations to 1000. For random forest, the default number of attributes foreach tree (one-third of total number of attributes) was used except for thefive-variable case in which the number of attributes was fixed at two. thedefault number of trees in the forest was 500, and each tree in the forest wasgrown until the leaf was either pure or had only fives samples. We didn’t usecross-validation to fine tune the random forest (as we do below for the SVM).All settings reflect the defaults of the randomForest v4.5-8 package availablein the R statistical programming language;

With the support vector machine, we chose a radial kernel to classify thespectra as implemented in the libSVM package (version 2.7)2. The radialkernel is flexible and performed well in preliminary studies. In order to selectthe optimal set of tuning parameters for radial kernel, a grid search schemewas adopted using a modified version of the grid.py python script distributedwith the libSVM package. The training set was randomly decomposed into twosubsets, and we use cross-validation to find the best combination of tuningparameters. The optimal parameters for these data found via course gridsearch were γ = 3.052e− 05 and cost = 32768.0.

1http://www.r-project.org2http://www.csie.ntu.tw/∼cjin/libsvm/

Page 23: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Statistical methods for classifying mass spectrometry database search results 23

1.5 Results and Discussion

1.5.1 Classification Performance on the ESI-SEQUEST dataset

Each mass spectrometry database search result can be sorted based onthe resulting scoring statistics, from most confident to least confident. ForPeptideProphet, the samples are ordered highest to lowest on the basis of aposterior probability as described in Keller et al. 2002a. In the case of the ma-chine learning algorithms discussed here, in addition to a correct or incorrectlabel, the algorithms also return an additional ”fitness” term. For randomforest, the fitness term can be interpreted as a probability of the identifica-tion being correct. A probability score can be generated from the boostingfitness measure as well using a simple transformation. The SVM returns aclassification and a measure of the distance to a distinguishing hyperplane inattribute space that can be considered a confidence measure. When samplesare ordered in this way, results of such classification and scoring can be rep-resented as a Receiver Operating Characteristic (ROC) plot, which providesa way of displaying the ratio of true positive classifications (sensitivity) tothe fraction of false positives (1 − specificity) as a function of a variable testthreshold, chosen on the ranked ordering of results produced by the classifier.Decision problems such as this are always a tradeoff between being able toselect the true positives without selecting too many false positives. If we setour scoring threshold very high, we can minimize or eliminate the numberof false positives, but at the expense of missing a number of true positives;conversely, as we lower the scoring threshold, we select more true positivesbut more false positives will be selected as well. The slopes of the ROC plotare a measure of the rate at which one group is included at the expense of theother.

The ESI-Sequest dataset allows us to compare all four classification ap-proaches: boosting, random forests, PeptideProphet, and the SVM. ROCplots showing the results of classifying correct vs incorrect peptide assign-ments of the ESI-Sequest dataset using these methods are shown in Figure1.3. All methods perform well on the data. As can be seen, the boosting andrandom forest methods provide a slight performance improvement over Pep-tideProphet and the SVM classification using the same six attributes. At afalse positive rate of roughly 0.05%, the boosting and random forest achievesa sensitivity of 99% while PeptideProphet and SVM provide a 97-98% sensi-tivity. We note that, although a systematic difference of 1-2% can be seen inthese results, this corresponds to a relatively small number of total spectra.Also indicated in Figure 1.3 are points corresponding to well-known thresholdsfrom several literature citations. Each point shows the sensitivity and speci-ficity that would be obtained on the test dataset by applying these publishedthresholds to the Sequest attributes Charge, Xcorr, Delta Cn, and NTT.

Page 24: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

24Statistical methods for classifying mass spectrometry database search results

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

1−Specificity

Se

nsitiv

ity

PeptideProphetSVMRandom ForestBoosting

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.8

00

.90

1.0

0

1−Specificity

Se

nsitiv

ity PeptideProphet

SVMRandom ForestBoosting

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Random Forest

1−Specificity

Se

nsitiv

ity

I, III, III, IVI, II, III, IV

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.9

00

.94

0.9

8

Random Forest

1−Specificity

Se

nsitiv

ity

I, III, III, IVI, II, III, IV

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Boosting

1−Specificity

Se

nsitiv

ity

I, III, III, IVI, II, III, IV

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.9

00

.94

0.9

8

Boosting

1−Specificity

Se

nsitiv

ity

I, III, III, IVI, II, III, IV

FIGURE 1.3: Performance of boosting and random forest methods onthe ESI-SEQUEST dataset. A. ROC plot of classification of the test setby PeptideProphet, SVM, boosting, and random forest methods using at-tribute groups I and II. The plot on the right is a blowup of the upperleft region of the figure on the left. Also displayed are points correspond-ing to several sets of SEQUEST scoring statistics used as linear thresholdvalues in published studies. The following criteria were applied for choos-ing correct hits, the +1, +2, +3 numbers indicating peptide charge: a) +1:XCorr ≥ 1.5, NTT = 2; +2, +3: XCorr ≥ 2.0, NTT = 2; b) ∆Cn > 0.1,+1: XCorr ≥ 1.9, NTT = 2; +2: XCorr ≥ 3 OR 2.2 ≤ XCorr ≤ 3.0,NTT ≥ 1; +3: XCorr ≥ 3.75, NTT ≥ 1; c) ∆Cn ≥ 0.08, +1: XCorr ≥ 1.8;+2: XCorr ≥ 2.5; +3: XCorr ≥ 3.5; d) ∆Cn ≥ 0.1, +1: XCorr ≥ 1.9,NTT = 2; +2: XCorr ≥ 2.2, NTT ≥ 1; +3: XCorr ≥ 3.75, NTT ≥ 1; e)∆Cn ≥ 0.1, Sp Rank ≤ 50, NTT ≥ 1, +1: not included; +2: XCorr ≥ 2.0;+3: XCorr ≥ 2.5. It can be seen that all machine learning approaches pro-vide a significant improvement over linear scoring thresholds. B. Results ofthe random forest method using various sets of attributes. The black line rep-resents the result of the random forest using six attributes defined in Table2 as groups I and II: the SEQUEST XCorr, Sp Rank, ∆Cn, Delta ParentMass, Length, and Number of Tryptic Termini (NTT). The red line is theresult using fourteen attributes, groups I, III, and IV (no NTT). The blue linerepresents the result using all attribute groups I-IV, all fifteen variables. C.ROC plot of the boosting method using attribute groups I and II (black); I,III, and IV (red); and I-IV (green).

Page 25: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Statistical methods for classifying mass spectrometry database search results 25

FIGURE 1.4: Relative importance of data attributes used for classificationby boosting and random forest methods. A. SEQUEST attribute importancefrom boosting and random forest classification using attribute groups I andII. B. SEQUEST attribute importance using attribute groups I, III, and IV.C. SEQUEST attribute importance using all attributes in random forest andboosting methods.

Page 26: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

26Statistical methods for classifying mass spectrometry database search results

Panels 1B and 1C compare the performance of the boosting and randomforest methods using different sets of input attributes, as shown in Table 2.The panels contains the results of these algorithms using three combinationsof features: 1) attribute groups I and II: the six attributes used by the Pep-tideProphet algorithm (SEQUEST XCorr, Delta Cn, SpRank, Delta ParentMass, Length, and NTT); 2) attribute groups I, III, and IV (all attributesexcept NTT); and 3) attribute group I-IV (all fifteen variables shown in Ta-ble 2). Overall, it can be seen that both machine learning approaches provideimprovement over the scoring thresholds described in the literature. The bestperformance was obtained by including all fifteen variables, indicating thataccommodation of additional information is beneficial. The random forestappears to be slightly more sensitive to the presence of the NTT variablethan boosting. Of note is the fact that effective classification is attained bythe boosting and random forest tools even in the explicit absence of the NTTvariable, as demonstrated by feature combination 2), despite the fact thatthe ESI dataset was generated using the no enzyme feature of Sequest. Noenzyme specificity in the database search is often time-prohibitive in routineproduction work; it is much more common to restrict searches to tryptic pep-tides (or any other proteolytic enzyme used to digest the protein sample).Restricting to trypsin restricts results to having an NTT=2, rendering theattribute non-discriminatory. It must be noted, however, that in this analysisthe C-term Residue attribute is not completely independent of NTT in thatit contains residue information on one of the termini.

Figure 1.3 contains the results of these algorithms using three combinationsof features: 1) the five features used by the PeptideProphet algorithm-SequestXCorr, Delta Cn, SpRank, Delta parent mass, and NTT; 2) all fourteen vari-ables discussed above except NTT; and 3) all eleven variables discussed aboveincluding NTT. Overall, it can be seen that all machine learning approachesprovide improvement over linear scoring thresholds. A modest performanceimprovement is achieved by these approaches over PeptideProphet using thesame five variables. The best performance was obtained by including allfourteen variables, indicating that accommodation of additional informationis beneficial. It is interesting that effective classification is attained by theboosting and random forest tools even in the absence of the NTT variable,as demonstrated by feature combination 2), despite the fact that the ESIdataset was generated using the ’no enzyme’ feature of Sequest. This type ofsearch is often time-prohibitive in routine production work; it is much morecommon to restrict searching to typtic peptides (or whatever other proteolyticenzyme used to digest the protein sample). SVM classification performed wellbut showed the most error of all methods; this performance could likely beimproved by more precise parameter tuning.

Page 27: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Statistical methods for classifying mass spectrometry database search results 27

1.5.2 The impact of individual attributes on the final pre-diction accuracy.

It is interesting to examine the relative usefulness of the various parametersused by the machine learning algorithms to classify search results. One cangain a feeling for the relative importance of each variable by randomly scram-bling the values of any one variable amongst all search results and re-runningeach search result (with the ’noised up’ variable) through the classifier. Theincrease in misclassification that results is a measure of the relative impor-tance of the randomized variable. This is valuable information in that thedevelopment of new scoring measures is an active area of investigation inproteomics research.

Figure 1.4 displays the relative importance each of the features for the SE-QUEST approaches using the various variables. Results for classification ofthe ESI-SEQUEST dataset incorporating the five features used by Peptide-Prophet for the ESI-SEQUEST dataset are shown in A. All five features showa contribution to the discrimination, with the most important contributionfrom the Number of Tryptic Terminii (NTT) categorical variable. Peptide-Prophet incorporates only the first four variables in the calculation of thediscriminant function, introducing NTT distributions using a separate jointprobability calculation. The coefficients for their discriminant score weightXCorr highest, followed by Delta Cn, with much lower contributions due toDelta M+H and Sp Rank. Our results indicate a larger contribution fromDelta Cn, followed by XCorr, with moderate contributions from Sp Rank andDelta M+H. These results agree with those from Anderson et al, with theexception of Delta M+H which showed very little contribution using theirSVM approach. The five features display a high importance when used inconjunction with the other six variables, as indicated in Figure 1.4B. Of theseadditional six, Sp score shows a surprising contribution, as this scoring mea-sure is rarely used for discrimination in popular usage. Also significant is theIons Ratio measure. These results are in agreement with the Fisher Scorescalculated by Anderson et al. The number of arginine and proline measures,as well as parent charge, appear to provide very little discrimative value.

The NTT variable provides by far the most important contribution, partic-ularly for the boosting approach, but is only obtainable in non-enzyme-specificsearches. The results above indicate, however, that the machine learning ap-proaches perform quite well even in the absence of this variable. The relativeimportances of the other measures in the absence of this variable are shownin 4C. In this scenario, the Delta Cn measure provides the most importancecontribution.

Unsupervised Learning and Generalization/Comparison to PeptideProphetIn general, there are two primary classes of algorithms for addressing thepattern recognition problem, the rule-based approach and the model-basedapproach. Algorithms such as boosting and RFs are rule-based (see Vapnik1999 for a general description). Rule-based approaches are characterized by

Page 28: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

28Statistical methods for classifying mass spectrometry database search results

choosing one particular classification rule out of a set of rules that minimizesthe number of mistakes on the training data. Model-based approaches arebased on estimation of distributions for each of the classes being modeled,and use these distributions to calculate the probability that a data point be-longs to each class. PeptideProphet is an example of this latter type, modelingcorrect identifications as a Gaussian distribution and incorrect identificationsas a Gamma distribution. If the distributions describing the different classesin the problem accurately reflect the physical processes by which the dataare generated, the modeling approach works well even for a small amount oftraining data. On the other hand, if the data diverge from the modeled dis-tributions in a significant way, classification errors proportional to the degreeof divergence result. Rule-based approaches are a less risky option if there islittle knowledge of the distributions of classes of the data, and become increas-ingly safe, approaching optimality, as data size increases. Keller et al 2002aand Nesvizhskii et al. 2003 demonstrate that, for their data, the distributionsdescribed in their approach model the data well. Whether these distribu-tions are appropriate for all types of instruments and MS search engines, andwhether they are optimal, is a research question. Given the large amount ofmass spectrometry data available, rule-based approaches may generalize wellto all types of data.

We note that both PeptideProphet and the boosting/Random Forest meth-ods are supervised approaches, relying on training data for their functionality.PeptideProphet uses training data to learn coefficients in the calculation ofthe discriminate score; it subsequently uses these scores to establish the basicshape of the probability distributions modeling correct and incorrect searchhits as a function of parent peptide charge. For each unique dataset, thedistribution parameters are refined using an EM algorithm. Our approachprovides a framework for performing the supervised aspect of the problem ina more general way, using established out-of-the-box functionality. This ap-proach can be coupled with an unsupervised component to provide the samefunctionality, assuming appropriate training datasets are available that matchthe input data. The degree to which an individual training dataset providesadequate parameterization for a particular test set is an open question. Cer-tainly, training sets will need to be search algorithm specific, but whetherinstrument-specific datasets are necessary is an area of investigation.

1.6 Conclusions

A serious challenge faced by researchers in a proteomics lab is curating largelists of protein identifications based on various confidence statistics generatedby the search engines. The common methodology for selecting true hits from

Page 29: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

Statistical methods for classifying mass spectrometry database search results 29

false positives is based on linear thresholding. These approaches can lead toa large number of false positives (using a more promiscuous threshold), or alarge number of false negatives (using a relatively stringent threshold). Ma-chine learning approaches such as boosting and random forest provide a moreaccurate method for classification of the results of MS/MS search engines aseither correct or incorrect. Additionally, algorithmic development continuesto improve the ability of automated tools to better discriminate true searchresults, and can complement the standard scoring measures generated by pop-ular search engines. Flexible methods that allow for accommodation of thesenew scoring measures are necessary to allow them to be easily incorporatedinto production use. Tools such as PeptideProphet require significant work toaccommodate any new features, and are based on statistical models which maynot generalize well to all situations. Modern machine learning approaches canperform equally well, if not better, out-of-the box with very little tuning. Im-proved results could very likely be obtained by tuning these tools to particulardata sets, ie. by making use of class prior probabilities to accommodate theimbalanced sizes of the correct and incorrect datasets. These approaches canadditionally be used to generate measures of relative importance of scoringvariables, and may be useful in the development of new scoring approaches.

Peptide identification is not the ultimate goal in proteomics; accuratelyidentifying the presence of proteins will allow the comparison between differentsamples and is the end result of biological analysis.

1.7 Acknowledgement

We thank our colleagues in National Resources for Proteomics and Path-ways for fruitful discussion on proteomics research and Ms. Rhiannon Popafor editorial assistance.

Page 30: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies
Page 31: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

References

[1] Aebersold, R., Goodlett, D. R. (2001) Mass spectrometry in pro-teomics. Chem. Rev. 101, 269-95.

[2] Anderson, D. C., Li, W., Payan, D. G., Noble, W. S. (2003) A newalgorithm for the evaluation of shotgun peptide sequencing in pro-teomics: support vector machine classification of peptide MS/MSspectra and SEQUEST scores. J. Proteome Res. 2, 137-46.

[3] Anderson, L., Seilhamer, J. (1997) A comparison of selected mRNAand protein abundances in human liver. Electrophoresis 18, 533-7.

[4] Bafna, V., Edwards, N. (2001) SCOPE: a probabilistic model forscoring tandem mass spectra against a peptide database. Bioinfor-

matics 17, Suppl. 1, S13-21.

[5] Bao, L., Cui, Y. (2005) Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural andevolutionary information. Bioinformatics 21, 2185-90.

[6] Breiman, L. (2001) Random Forests. Machine Learning 45, 5-32.

[7] Brown M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet,C. W., Furey, T. S., Ares, M. Jr., Haussler, D. (2000) Knowledge-based analysis of microarray gene expression data by using supportvector machines. Proc. Natl. Acad. Sci. U. S. A. 97, 262-7.

[8] Clauser, K. R., Baker, P., Burlingame A. L. (1999) Role of accuratemass measurement (+/- 10 ppm) in protein identification strategiesemploying MS or MS/MS and database searching. Anal. Chem. 71,2871-82.

[9] Craig, R., Beavis, R. C. (2004), TANDEM: matching proteins withtandem mass spectra. Bioinformatics 20, 1466-7.

[10] Dempster, A. P., Laird, N. M., Rubin, D. B. (1977) Maximumlikelihood from incomplete data via EM algorithm. J Roy Stat Soc

Series B /bf 39, 1-38.

[11] Eng, J., McCormack, A., Yates, J. (1994) An approach to correlatetandem mass spectral data of peptides with amino acid sequencesin a protein database. J. Am. Soc. Mass Spec. 5, 976-989.

0-8493-0052-5/00/$0.00+$.50c© 2001 by CRC Press LLC 31

Page 32: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

32 References

[12] Eriksson, J, Fenyo, D. (2004) Probity: a protein identification al-gorithm with accurate assignment of the statistical significance ofthe results. J Proteome Res. 3, 32-6.

[13] Fenyo, D., Beavis, R. C. (2003) A method for assessing the statisti-cal significance of mass spectrometry-based protein identificationsusing general scoring schemes. Anal Chem. 75, 768-74.

[14] Freund, Y. and Schapire, R. (1995) A decision theoretic generaliza-tion of on-line learning and an application to boosting. Proceedings

of the 2nd European Conference on Computational Learning The-

ory. 23-37, Springer, New York.

[15] Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schum-mer, M., Haussler, D. (2000) Support vector machine classificationand validation of cancer tissue samples using microarray expressiondata. Bioinformatics 16, 906-914.

[16] Geer, L. Y. , Markey, S. P., Kowalak, J. A., Wagner, L., Xu, M.,Maynard, D. M., Yang, X., Shi, W., Bryant, S. H. (2004) Openmass spectrometry search algorithm. J Proteome Res. 3, 958-64.

[17] Gygi, S. P., Rochon. Y., Franza, B.R., Aebersold, R. (1999) Corre-lation between protein and mRNA abundance in yeast. Mol. Cell

Biol. 19, 1720-30.

[18] Hastie, T., Tibshirani, R., Friedman, J. H. (2001) Elements of

Statistical Learning. Springer, New York.

[19] Havilio, M., Haddad, Y., Smilansky, Z. (2003) Intensity-based sta-tistical scorer for tandem mass spectrometry. Anal. Chem. 75, 435-44.

[20] Jaakkola, T., Diekhans, M. Haussler, D. (1999) Using the Fisherkernel method to detect remote protein homologies. Proc. Int.

Conf. Intell. Syst. Mol. Bio., 149-158. AAAI Press, Menlo Park,CA.

[21] Johnson, R. S., Davis, M. T., Taylor, J. A., Patterson, S. D. (2005)Informatics for protein identification by mass spectrometry. Meth-

ods. 35, 223-236.

[22] Kapp, E. A., Schutz, F., Reid, G. E., Eddes, J. S., Moritz, R.L, O’Hair, R. A., Speed, T. P., Simpson, R. J. (2003) Mining atandem mass spectrometry database to determine the trends andglobal factors influencing peptide fragmentation. Anal. Chem. 75,6251-64.

[23] Keller, A., Nesvizhskii, A. I., Kolker, E. Aebersold, R.. (2002) Em-pirical Statistical Model To Estimate the Accuracy of Peptide Iden-

Page 33: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

References 33

tifications Made by MS/MS and Database Search. Anal. Chem. 74,5383-5392.

[24] Keller, A., Purvine, S., Nesvizhskii, A. I., Stolyar, S., Goodlett, D.R., Kolker, E. (2002) Experimental protein mixture for validatingtandem mass spectral analysis. OMICS. 6, 207- 12.

[25] Link, A. J., Eng, J., Schieltz, D. M., Carmack, E., Mize, G. J.,Morris, D. R., Garvik, B. M., Yates, J. R. III (1999) Direct analysisof protein complexes using mass spectrometry. Nat. Biotechnol. 17,676- 682.

[26] Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo,M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M.,Horton, H. et al. (1996) Expression monitoring by hybridization tohigh-density oligonucleotide arrays. Nat. Biotechnol. 14 1675-1680.

[27] MacCoss, M. J., Wu, C. C., Yates, J. R. 3rd. (2002) Probability-based validation of protein identifications using a modified SE-QUEST algorithm. Anal. Chem. 74, 5593-9.

[28] Moore, R. E., Young, M. K., Lee, T. D. (2002) Qscore: an algo-rithm for evaluating SEQUEST database search results. J. Am.

Soc. Mass Spectrom. 13, 378-86.

[29] Nesvizhskii, A. I., Keller, A., Kolker, E., Aebersold, R. (2003) Astatistical model for identifying proteins by tandem mass spectrom-etry. Anal. Chem. 75, 4646-58.

[30] Pandey, A., Mann, M. (2000) Proteomics to study genes andgenomes. Nature 405, 837-46.

[31] Peng, J., Elias, J. E., Thoreen, C. C., Licklider, L. J., Gygi, S.P. (2003) Evaluation of multidimensional chromatography coupledwith tandem mass spectrometry (LC/LC-MS/MS) for large-scaleprotein analysis: the yeast proteome. J. Proteome Res. 2, 43-50.

[32] Pennington, K., Cotter, D., Dunn, M. J. (2005) The role of pro-teomics in investigating psychiatric disorders. Br. J. Psychiatry

187, 4-6.

[33] Perkins, D., Pappin, D., Creasy, D., Cottrell, J. (1999) Probability-based protein identification by searching sequence databases usingmass-spectromety data. Electorphoresis 20, 3551-3567.

[34] Sadygov, R. G., Yates, J. R. 3rd. (2003) A hypergeometric proba-bility model for protein identification and validation using tandemmass spectral data and protein sequence databases. Anal. Chem.

75, 3792-8. 30

Page 34: I This is a Part 7 - Statisticsdept.stat.lsa.umich.edu/~jizhu/pubs/tmp/QinEtAl-Chapman08.pdf · statsitician and bioinformatician in the post-genomics era. An array of strategies

34 References

[35] Schena, M., Shalon, D., Davis, R. W., Brown, P. O. (1995) Quanti-tative monitoring of gene expression patterns with acomplementaryDNA microarray. Science 270, 467-470.

[36] Steen, H., Mann, M. (2004) The ABC’s (and XYZ’s) of peptidesequencing. Nat. Rev. Mol. Cell Biol. 5, 699-711.

[37] Sun W, Li F, Wang J, Zheng D, Gao Y. (2004) AMASS: softwarefor automatically validating the quality of MS/MS spectrum fromSEQUEST results. Mol. Cell Proteomics. 3, 1194-9.

[38] Tabb D. L., Huang, Y., Wysocki, V. H., Yates, J. R. 3rd. (2004) In-fluence of basic residue content on fragment ion peak intensities inlow-energy collision-induced dissociation spectra of peptides. Anal.

Chem. 76, 1243-8.

[39] Ulintz P. J., Zhu J., Qin, Z. S., Andrews P. C. (2006) Improvedclassification of mass spectrometry database search results usingnewer machine learning approaches. Mol. Cell Proteomics 5, 497-509.

[40] Vapnik V. N. (1999) The Nature of Statistical Learning Theory.

30, 138-167. Springer, New York.

[41] Washburn, M. P., Wolters, D., Yates, J. R. 3rd. (2001) Large-scale analysis of the yeast proteome by multidimensional proteinidentification technology. Nat. Biotechnol. 19, 242-7.

[42] Wysocki, V. H., Tsaprailis, G., Smith, L. L., Breci, L.A. (2000)Mobile and localized protons: a framework for understanding pep-tide dissociation. J. Mass Spectrom. 35, 1399-406.

[43] Zhang, N., Aebersold, R., Schwikowski, B. (2002) ProbID: a prob-abilistic algorithm to identify peptides through sequence databasesearching using tandem mass spectral data. Proteomics 2, 1406-12.