Assembling proteomics data as a prerequisite for the analysis of large scale experiments

10
Page 1 of 10 (page number not for citation purposes) Chemistry Central Journal Open Access Research article Assembling proteomics data as a prerequisite for the analysis of large scale experiments Frank Schmidt* 1,4 , Monika Schmid 1 , Bernd Thiede 2 , Klaus-Peter Pleißner 3 , Martina Böhme 3 and Peter R Jungblut 1 Address: 1 Max Planck Institute for Infection Biology, Core Facility Protein Analysis, Berlin, Germany, 2 The Biotechnology Centre of Oslo, University of Oslo, Oslo, Norway, 3 Max Planck Institute for Infection Biology, Core Facility Bioinformatics, Berlin, Germany and 4 Interfaculty Institute for Genetics and Functional Genomics, University of Greifswald, Greifswald, Germany Email: Frank Schmidt* - [email protected]; Monika Schmid - [email protected]; Bernd Thiede - [email protected]; Klaus-Peter Pleißner - [email protected]; Martina Böhme - Boehme@mpiib- berlin.mpg.de; Peter R Jungblut - [email protected] * Corresponding author Abstract Background: Despite the complete determination of the genome sequence of a huge number of bacteria, their proteomes remain relatively poorly defined. Beside new methods to increase the number of identified proteins new database applications are necessary to store and present results of large- scale proteomics experiments. Results: In the present study, a database concept has been developed to address these issues and to offer complete information via a web interface. In our concept, the Oracle based data repository system SQL-LIMS plays the central role in the proteomics workflow and was applied to the proteomes of Mycobacterium tuberculosis, Helicobacter pylori, Salmonella typhimurium and protein complexes such as 20S proteasome. Technical operations of our proteomics labs were used as the standard for SQL-LIMS template creation. By means of a Java based data parser, post-processed data of different approaches, such as LC/ESI-MS, MALDI-MS and 2-D gel electrophoresis (2-DE), were stored in SQL-LIMS. A minimum set of the proteomics data were transferred in our public 2D- PAGE database using a Java based interface (Data Transfer Tool) with the requirements of the PEDRo standardization. Furthermore, the stored proteomics data were extractable out of SQL-LIMS via XML. Conclusion: The Oracle based data repository system SQL-LIMS played the central role in the proteomics workflow concept. Technical operations of our proteomics labs were used as standards for SQL-LIMS templates. Using a Java based parser, post-processed data of different approaches such as LC/ESI-MS, MALDI-MS and 1-DE and 2-DE were stored in SQL-LIMS. Thus, unique data formats of different instruments were unified and stored in SQL-LIMS tables. Moreover, a unique submission identifier allowed fast access to all experimental data. This was the main advantage compared to multi software solutions, especially if personnel fluctuations are high. Moreover, large scale and high-throughput experiments must be managed in a comprehensive repository system such as SQL-LIMS, to query results in a systematic manner. On the other hand, these database systems are expensive and require at least one full time administrator and specialized lab manager. Moreover, the high technical dynamics in proteomics may cause problems to adjust new data formats. To summarize, SQL-LIMS met the requirements of proteomics data handling especially in skilled processes such as gel-electrophoresis or mass spectrometry and fulfilled the PSI standardization criteria. The data transfer into a public domain via DTT facilitated validation of proteomics data. Additionally, evaluation of mass spectra by post-processing using MS- Screener improved the reliability of mass analysis and prevented storage of data junk. Published: 23 January 2009 Chemistry Central Journal 2009, 3:2 doi:10.1186/1752-153X-3-2 Received: 10 October 2008 Accepted: 23 January 2009 This article is available from: http://journal.chemistrycentral.com/content/3/1/2 © 2009 Schmidt et al

Transcript of Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Page 1: Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Chemistry Central Journal

ss

Open AcceResearch articleAssembling proteomics data as a prerequisite for the analysis of large scale experimentsFrank Schmidt*1,4, Monika Schmid1, Bernd Thiede2, Klaus-Peter Pleißner3, Martina Böhme3 and Peter R Jungblut1

Address: 1Max Planck Institute for Infection Biology, Core Facility Protein Analysis, Berlin, Germany, 2The Biotechnology Centre of Oslo, University of Oslo, Oslo, Norway, 3Max Planck Institute for Infection Biology, Core Facility Bioinformatics, Berlin, Germany and 4Interfaculty Institute for Genetics and Functional Genomics, University of Greifswald, Greifswald, Germany

Email: Frank Schmidt* - [email protected]; Monika Schmid - [email protected]; Bernd Thiede - [email protected]; Klaus-Peter Pleißner - [email protected]; Martina Böhme - [email protected]; Peter R Jungblut - [email protected]

* Corresponding author

AbstractBackground: Despite the complete determination of the genome sequence of a huge number of bacteria, theirproteomes remain relatively poorly defined. Beside new methods to increase the number of identified proteinsnew database applications are necessary to store and present results of large- scale proteomics experiments.

Results: In the present study, a database concept has been developed to address these issues and to offercomplete information via a web interface. In our concept, the Oracle based data repository system SQL-LIMSplays the central role in the proteomics workflow and was applied to the proteomes of Mycobacterium tuberculosis,Helicobacter pylori, Salmonella typhimurium and protein complexes such as 20S proteasome. Technical operationsof our proteomics labs were used as the standard for SQL-LIMS template creation. By means of a Java based dataparser, post-processed data of different approaches, such as LC/ESI-MS, MALDI-MS and 2-D gel electrophoresis(2-DE), were stored in SQL-LIMS. A minimum set of the proteomics data were transferred in our public 2D-PAGE database using a Java based interface (Data Transfer Tool) with the requirements of the PEDRostandardization. Furthermore, the stored proteomics data were extractable out of SQL-LIMS via XML.

Conclusion: The Oracle based data repository system SQL-LIMS played the central role in the proteomicsworkflow concept. Technical operations of our proteomics labs were used as standards for SQL-LIMS templates.Using a Java based parser, post-processed data of different approaches such as LC/ESI-MS, MALDI-MS and 1-DEand 2-DE were stored in SQL-LIMS. Thus, unique data formats of different instruments were unified and storedin SQL-LIMS tables. Moreover, a unique submission identifier allowed fast access to all experimental data. Thiswas the main advantage compared to multi software solutions, especially if personnel fluctuations are high.Moreover, large scale and high-throughput experiments must be managed in a comprehensive repository systemsuch as SQL-LIMS, to query results in a systematic manner. On the other hand, these database systems areexpensive and require at least one full time administrator and specialized lab manager. Moreover, the hightechnical dynamics in proteomics may cause problems to adjust new data formats. To summarize, SQL-LIMS metthe requirements of proteomics data handling especially in skilled processes such as gel-electrophoresis or massspectrometry and fulfilled the PSI standardization criteria. The data transfer into a public domain via DTTfacilitated validation of proteomics data. Additionally, evaluation of mass spectra by post-processing using MS-Screener improved the reliability of mass analysis and prevented storage of data junk.

Published: 23 January 2009

Chemistry Central Journal 2009, 3:2 doi:10.1186/1752-153X-3-2

Received: 10 October 2008Accepted: 23 January 2009

This article is available from: http://journal.chemistrycentral.com/content/3/1/2

© 2009 Schmidt et al

Page 1 of 10(page number not for citation purposes)

Page 2: Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Chemistry Central Journal 2009, 3:2 http://journal.chemistrycentral.com/content/3/1/2

BackgroundA major goal of proteomics is the large-scale study of pro-teins, particularly their structures and functions includingthe global qualitative and quantitative analysis of proteinsin defined biological systems. The term proteomics waschosen to make an analogy with genomics, but proteom-ics is significantly more complex. As a result of alternativesplicing, point-mutations, degradations and co- and post-translational modifications, the number of protein spe-cies [1] of a proteome exceeds by far the number of pro-tein-coding genes of the corresponding genome. In thepast, qualitative proteome profiling has overcome limita-tions in protein identification due to the amazing devel-opments in mass spectrometry. Increased sensitivity andmass accuracy in conjunction with comprehensive data-base annotations allows the high-throughput identifica-tion of proteins. On the other hand, quantitativeprofiling, an essential part of proteomics, requires tech-nologies that accurately, reproducibly, and comprehen-sively quantify proteins. During the past years, novel massspectrometry-based methods such as ICAT [2], SILAC [3]and iTRAQ [4] were developed for relative quantification.The amount of identification and quantification dataincreased dramatically during the recent years andresulted in the accumulation of "metadata", which meansdata about data. The manufacturers of ESI-MS andMALDI-MS instruments and image analysis software haveendeavored to close the gap between the increasedamount of information and its interpretation. However,this mostly resulted in individual solutions for each com-pany which hampered the exchange of experimental data.However, beside commercial solutions some open LIMSsystems such as PROTEIOS [5] or the open source labora-tory information management system for 2-D gel electro-phoresis-based proteomics workflows [6] are availablefree of charge and some of them were compared in moredetail by Piggee et al. [7]. The representation of proteindata must be standardized to compare proteomics resultsworldwide. For this purpose, some solutions were pro-posed, such as the Proteome Standards Initiative (PSI)[8,9], and PEDRo [10]. The latter yielded to adapt XML orspecialized mzXML [11] or mzML [12] which are open fileformats for data exchange.

In our concept, the Oracle-based data repository systemSQL-LIMS™ (Applied Biosystems, Foster City, USA) playsthe central role in the proteomics workflow and wasapplied to the proteomes of Mycobacterium tuberculosis,Helicobacter pylori, Salmonella typhimurium and proteincomplexes such as the 20S proteasome. Technical opera-tions of our proteomics workflow were used as the stand-ard for SQL-LIMS™ template creation. Post-processed dataof different approaches, such as LC/ESI-MS, MALDI-MSand 2-DE gel electrophoresis were stored in SQL-LIMS™by using a Java-based data parser. A minimum set of the

proteomics data were transferred into the web-accessibleProteome Database System for Microbial Research http://www.mpiib-berlin.mpg.de/2D-PAGE/[13] using a Java-based interface (Data Transfer Tool) with the require-ments of the PEDRo standardization. Furthermore, thestored proteomics data were extractable out of SQL-LIMS™as XML documents.

Results and discussionConcept for integration of proteomics dataWe applied a variety of 2-DE and LC-based approaches forthe comprehensive proteome analysis of microorganismsand other protein complexes. These technologiesincluded 2-DE/MS coupled with image analysis, 1-DE/MS, ICAT/1-DE/MS, ICAT/2-DE/MS, LC/MS and ICAT/LC/MS (Figure 1). The enormous amount of informationgenerated by these proteome analyses required the appli-cation of suitable programs for data integration and arepository in order to gain maximum benefit from theexperimental results. In the past, the urgent need for suchprograms has often been emphasized, but the develop-ment of adequate programs was hampered by the largediversity of data formats. SQL*LIMS™ enabled the integra-tion and storage of data emerging from all kinds of pro-teome analyses, e.g. sample preparation, 2-DE analyses, aswell as raw and evaluated MS data. This allowed efficientdata handling, in particular for the evaluation of largeexperimental datasets. Moreover, the storage of metadataproduced during the laboratory work was established.Therefore, specific templates were created where thewhole workflows as well as the protocols were imple-mented including all information about the biologicalsamples and the applied preparation steps. In order toconnect the metadata with the results from 2-DE and massspectrometry measurements, binary files such as fromimage analysis calculations, mass spectrometry peak listsor identification results from Mascot or SEQUEST wereparsed into an Oracle database. Raw data such as 2-DE gelimages or MS spectra were stored as attachments. Further-more, most of the experimental data were post-processedbefore storing into the SQL*LIMS™, e.g. by MS-Screener[14], to decrease the amount of data junk.

However, there is no doubt that administration of pro-grams such as SQL*LIMS™ are time consuming due to dif-ficulties in template and interface programming. Thus,SQL*LIMS™ needed to be maintained by at least one fulltime administrator and specialized lab-manager. To over-come extensive training in SQL*LIMS™ and to make pro-teomics data available, we have developed a data transfertool (DTT) as shown in Figure 1. This interface means thatexperimental data stored in SQL*LIMS™ can automati-cally be transferred into the Proteome Database System,which makes the results easily accessible. In this domain,authorized persons have access to all evaluated data. In

Page 2 of 10(page number not for citation purposes)

Page 3: Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Chemistry Central Journal 2009, 3:2 http://journal.chemistrycentral.com/content/3/1/2

the Proteome Database, experimental data were linkedwith protein databases, such as Swiss-Prot/UniProt, NCBIor KEGG. Moreover, a higher-level investigation of thedata can be performed using the large number of sophis-ticated functions and packages of the software environ-

ment for statistical computing and graphics R http://www.r-project.org/. The advantage of this concept is thatall information from different experiments is gathered inone system used for daily laboratory needs and whichcomplements the web-accessible database system used for

Concept of proteomics data assemblingFigure 1Concept of proteomics data assembling. Data of the proteomics workflow such as 2-DE/MS coupled with image analysis, ICAT/2-DE/MS, ICAT/1-DE/MS, LC/MS, and ICAT/LC/MS was stored in SQL*LIMS™. Raw data were stored as an attachment or link to their repository. As a result, the data were transferred and unified into one system. Subsequently, a minimum set of information about the proteomics experiments and results were transferred into the web-accessible Proteome Database Sys-tem for Microbial Research using the data transfer tool (DTT).

22--DEDE22--DE/MSDE/MS

ICAT/2ICAT/2--DE/MSDE/MS22--DEDE

Image analysisImage analysis

LC/MSLC/MSICAT/LC/MSICAT/LC/MS

11--DE/MSDE/MSICAT/1ICAT/1--DE/MSDE/MS

MSMS

Proteome DatabaseProteome Databasewww.mpiibwww.mpiib--berlin.mpg.de/2Dberlin.mpg.de/2D--PAGEPAGE

Firewall

Firewall

InternetInternet

IntranetIntranet

WWW.WWW.

ICATICAT

JAVA InterfaceJAVA Interface

Data Data Transfer Transfer Tool (DTT)Tool (DTT)

Page 3 of 10(page number not for citation purposes)

Page 4: Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Chemistry Central Journal 2009, 3:2 http://journal.chemistrycentral.com/content/3/1/2

data dissemination. The users have a unique and easyaccess to complex data sets. Moreover, already publishedexperimental data can be transferred into the public inter-net domain.

Data storing in SQL*LIMS™The requirements for data storing in SQL*LIMS™ dependon the experimental workflow. As a result, the data man-agement system must contain specifically designed fea-tures (Figure 2). In order to structure differentexperiments, SQL*LIMS™ allowed to define studies. Dur-ing the study initialization, only predefined attributesmust be stored to save information about the frame andthe goal of the experiment. For sample preparation, flexi-bility was also very important to track experimental work-flows which often included a complex sequence of

operations. To meet these requirements, predefined basicsample types were combined in a hierarchical parent-child relationship tree and new attributes can be added tothe predefined types. The following protein separationstep required a full integration with 2-DE gel image anal-ysis tools (e.g. Topspot, and PDQuest). Data of detected 2-DE spots or 1-DE bands were automatically acquired(uploaded) from image processing tool output files alongwith gel image pictures. In this case, structured data withfixed formats, such as spot coordinates, intensities or spotvolumes, come together with unstructured raw data whichhave no common format, such as gel images and nativereport files. The information was spread into all these dif-ferent records in the storage system, but it was availablefor the user as a single bulk through a gel viewer widgetthat allowed the drilling of spot information. In the MS

The core of the SQL*LIMS™ system and the integrated applications SQL*GT (microtiter plate solution) and Proteomics Solu-tionFigure 2The core of the SQL*LIMS™ system and the integrated applications SQL*GT (microtiter plate solution) and Proteomics Solution. In a first step, proteomics studies can be defined by SQL *LIMS™ or SQL*GT for microtiter plates and structured by the Proteomics Workflow Manager. As a result, any experiments get a unique submission and sample identi-fier. Using the Proteomics DB Objects, gel images (Universal Gel Loader) and mass spectra (Universal Peak Loader) can be assigned to a specific study and evaluated by the Protein Searcher. Moreover, existing identification results from Mascot (.html, .dat), MS-Fit (.html), SEQUEST (.xls), or Lynx (.txt) can be parsed to SQL*LIMS™. Furthermore, experiment specific data can be queried and reported by the program Query Builder.

Page 4 of 10(page number not for citation purposes)

Page 5: Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Chemistry Central Journal 2009, 3:2 http://journal.chemistrycentral.com/content/3/1/2

analysis step, both structured and unstructured data mustbe managed as in the protein separation step. In addition,features for the direct exchange of real-time bi-directionaldata with MS instruments were provided for the work listsuploading and the peak lists downloading. A more com-plex strategy for unstructured data management wasrequired due to the massive amounts of raw data with pro-prietary format. Instead of passively storing all the data inthe SQL*LIMS™ database without any chance to extractthe content for data searching, raw data files were savedinto the storage server and/or the permanent storage sup-ports. The locations can then be tracked into theSQL*LIMS™. For protein identification, MS peak lists weresubmitted to search database engines such as Mascot, MS-Fit, MS-Tag, and SEQUEST. Queries can be performeddirectly from SQL*LIMS™ or by using the search enginefront-end interface. In both cases, protein identificationresults were stored in the SQL*LIMS™ database. Again, theneed for managing unstructured electronic records wascoped. The stored gel information and protein identifica-tion results of proteomics studies can now easily bereported by the Report Builder, printed out or sent as ane-mail.

Transfer of SQL*LIMS™ data into the Intranet/Internet database via DTTIn order to share the experimental results with other labo-ratories rather easily, the DTT was designed to facilitatethe transfer out of the SQL*LIMS™ into the proteomedatabase system (Figure 3). The DTT provides a GUI toenable the user to select datasets to be transferred. First,only necessary data records were selected among the vastamount of data stored in the SQL*LIMS™. The DTT dis-played the gel data from SQL*LIMS™ corresponding tothe identification numbers in the 2-DE database (Figure3B). The user can choose a gel and the relevant data for thedisplayed spots. This included for example the sequencecoverage, score, rank and molecular weight of identifiedproteins for each spot on the gel in question (Figure 3C).It is also possible to check newer entries for a spot in the2-DE database which will be updated in the transfer proc-ess. A release number was assigned to selected proteinidentification data for the transfer. These release numberscan be used to control the degree of accessibility of thedata in the 2-DE database. Thus, it was possible to restrictthe view for certain release numbers. The usage of the DTTwas password protected and all data transactions werelogged. If the button "Transfer" was selected and "Save"was pressed, an existing release number was selected forthis spot or assigned to a new one. The concept of therelease numbers was established in order to control publicaccess and to inform SQL*LIMS™ managers if a new pro-tein could be identified due to new batch searches usingthe newest database releases. Currently, only data dis-played in Figure 3C can be transferred but it is planned to

include MS-data extracted from the ms peak SQL*LIMS™table (Table 1).

Pre and post-processing LC/ESI-MS/MS dataTandem mass spectrometry has been particularly usefulfor determining the protein components of complex mix-tures. The following strategy was applied to evaluate LC/ESI-MS/MS peak lists data: MS/MS spectra were automat-ically transformed into peak lists (.dta-files) by SEQUESTand subsequently imported into MS-Screener for generat-ing data matrices. The binary matrices were subjected tohierarchical agglomerative cluster analyses performed bymeans of the hclust- function within R. To illustrate anexample of cluster analyses, Figure 4 depicts a result of adataset comprising 873 MS/MS peak lists. The spectrawere derived from a comparative analysis of two rat liverproteasome subtypes [15] measured by ICAT/LC/ESI-MS/MS. The cluster analysis was applied to determine whichspectra showed some degree of similarity, i.e. commonpeptide or experiment specific mass peaks. The analysisresulted in a dendrogram of mass spectra with high simi-larity formed branches. The subset of spectra framed inFigure 4 shared numerous polymer contaminant masses,caused by avidin chromatography which was used for theseparation of ICAT labeled peptides. As a result, morethan half of the 125 polymer mass spectra have beenfalsely fitted to a rat specific peptide and the Mascot MS/MS ion search result shown in Figure 4 represents a falsepositive match of a MS/MS spectrum. After removing thepolymers containing mass spectra from the data set, theremaining 748 of 873 mass spectra were searched again.Subsequently, only the spectra of relevance and theirSEQUEST or Mascot identification results were stored inSQL*LIMS™. The strategy outlined in this section demon-strated the capability to clean peak list data sets from con-taminants in order to improve the reliability ofidentifications and to reduce the number of stored data.

Pre and post-processing MALDI-MS dataProteins separated by 2-DE were identified by peptidemass fingerprinting (PMF) after in-gel digestion. A Voy-ager Elite MALDI-TOF mass spectrometer and/or a 4700Proteomics Analyzer MALDI-TOF/TOF instrument wereused for this purpose. MS peak lists were generated by theprogram GRAMS or the peak-to-mascot script of the pro-gram 4700 Explorer™. In addition, the peak lists were eval-uated by the program MS-Screener. Experimentallyderived contaminant masses, e.g., masses matching tomatrix, keratins, and autolysis products of trypsin or dyewere detected and deleted from the spectra [14]. The sim-plified peak lists were analyzed by PMF using search algo-rithms, such as Mascot or MS-Fit. Subsequently, themodified peak list and the identification results wereparsed and stored in SQL*LIMS™.

Page 5 of 10(page number not for citation purposes)

Page 6: Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Chemistry Central Journal 2009, 3:2 http://journal.chemistrycentral.com/content/3/1/2

Page 6 of 10(page number not for citation purposes)

Proteome data transmissionFigure 3Proteome data transmission. A data transfer tool can be used to extract data from SQL*LIMS™ (A, B). In the first step, gels of a study were listed (B). In the second step, identified proteins and their main information e.g. name, accession number, ORF, sequence coverage, score and rank will be shown in a new window (C). Finally, protein identification data of interest can be selected for the transfer to the intra or public domain of the Proteome Database System (D).

SQL*LIMS

Data Transfer Tool

A

B

Data Transfer ToolC

D

Page 7: Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Chemistry Central Journal 2009, 3:2 http://journal.chemistrycentral.com/content/3/1/2

ExperimentalTwo-dimensional electrophoresis (2-DE)Protein samples from microorganisms were subjected tohigh-resolution 2-DE (gel size: 23 cm × 30 cm) [16]. Gen-erated 2-DE gels were scanned and subsequently evalu-ated by image analysis programs Topspot (Algorithmus,Berlin, Germany) [17] and PDQuest (Bio-Rad Laborato-ries, Hercules, CA, USA).

Automated 2-DE spot processingHigh-throughput MALDI-MS PMF was performed as fol-lows: Spots of interest were excised from 2-DE gels, trans-ferred into 96-well microtiter plates, and digested withtrypsin using a spot-cutter (Proteome Works, Bio-Rad,Hercules, CA, USA). Subsequently, equal volume ofresulting peptides and α-cyano-4-hydroxycinnamic acid(CHCA) were mixed and spotted onto MALDI templatesby the Ettan spot-handling workstation (Amersham Bio-sciences, Uppsala, Schweden). Subsequently, MALDIspectra were internally calibrated and the resulting peaklists exported using the "Peak-to-Mascot" script of the4700 Explorer software (Version 2.0) (Applied Biosys-tems, Foster City, USA). The parameters applied for thisprocess were optimized (signal-to-noise ratio, mass range,peak density, etc.). Afterwards, the MS-Screener programwas used to determine and to remove common contami-nant masses.

Data analysis by MS-ScreenerThe program MS-Screener (Version 1.0.1) was applied toevaluate large datasets of peak lists. This program com-prised 162 Java classes and has been developed for Java 2Runtime Environment (Version J2RE 1.4.1; http://java.sun.com/). MS-Screener offers a multi-platform sup-port for Linux, Solaris and Microsoft Windows includinga helpful graphical user interface (GUI). Graphical repre-

sentations of peak lists as plot-views have been integratedusing the JFreeChart class library (Version 0.9.13) http://www.jfree.org/jfreechart/index.html published under theGNU Lesser General Public License. MS-Screener facili-tates the import and export of ASCII files (.pkm (GRAMS,Applied Biosystems, Framingham, USA), .pkt (DataExplorer, Applied Biosystems, Framingham, USA), .txt(Peak-to-Mascot, 4700 Explorer, Applied Biosystems,Framingham, USA) and .dta (SEQUEST, Thermo Finni-gan, San Jose, USA)) and data exchange via other inter-faces. MS-Screener was used for many tasks, e.g. thedetection of common mass peaks, the elimination of con-taminant masses, and the calculation of the half decimalplaces rule [14]. Furthermore, it was used to generate peaklists matrices as a prerequisite for cluster analyses using R.Moreover, the recalibration of binary peak lists and a peakpair comparison tool to determine ICAT ratios wereapplied. The MS-Screener results were transformed in tab-separated files (.txt) to transfer the data into SQL*LIMS™.

Mass spectrometry and protein identification/quantificationFor protein identifications, 2-DE spots were analyzed byMALDI-MS or MS/MS or ESI-MS/MS [16,18-20]. In mostcases, spots to be identified were digested by trypsin priorto MS analysis [21]. MALDI-MS was carried out using aVoyager Elite MALDI-TOF mass spectrometer or a 4700Proteomics Analyzer MALDI-TOF/TOF (both fromApplied Biosystems, Framingham, USA). Protein identifi-cations were achieved by database comparisons usingsearch algorithms such as Mascot [22] or MS-FIT http://prospector.ucsf.edu, whereby Mascot was available as in-house version. Searches were accomplished either individ-ually or in batch mode (analysis of large datasets). In thelatter case, Mascot-Daemon http://www.matrix sci-ence.com was used as batch interface. Individual searches

Table 1: Overview of the different file formats stored in SQL *LIMS™

Workflow Step Structured Data Unstructured Data

Sample Preparation - Combination of basic sample types- > 100 sample attributes

Protein Separation Spot data - PDQuest® gel image (.tif files)- PDQuest® spot reports (text ASCII files)- Topspot gel image (.jpg files)- Topspot spot report (text ASCII files)

MS Analysis MS peaklist data - Thermo Finnigan LCQ™ DECA peaklist files (text ASCII files)- Thermo Finnigan LCQ™ raw data location reference- MALDI Voyager Elite raw peaklist files (text ASCII files)

Protein Identification Protein dataPeptide data

- Matrix Science Mascot® PMF v 1.8 search result (.html files)

- Matrix Science Mascot® MSMS Query v 1.8 search result (.html files)- Matrix Science Mascot® Sequence v 1.8 search result (.html file)- ProteinProspector MS-Fit® v 3.2.1, 3.4.1 search result (.html files)- ProteinProspector MS-Tag® v 3.2.1, 3.4.1 search result (.html files)- SEQUEST® v 2.0 search result (.xls files)

Page 7 of 10(page number not for citation purposes)

Page 8: Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Chemistry Central Journal 2009, 3:2 http://journal.chemistrycentral.com/content/3/1/2

were performed by the Mascot web-front end or the SQL-LIMS™ clients, respectively, and both were connected within-house Mascot server. The search parameters appliedhave previously been described [21]. Moreover, proteinswere separated and identified by large-scale on-line LC/ESI-MS/MS. The protein samples were prepared asdescribed [23] and measured by LCQ ion trap mass spec-trometer (Thermo Finnigan, San Jose, USA). For peptideidentifications, the generated MS/MS spectra were evalu-ated using the SEQUEST analysis program and/or Mascot.In order to quantify differences between 20S proteasomesubtypes [15,24] and proteomes of M. tuberculosis andbovis BCG [23], proteins were labelled with the ICAT rea-gent and analyzed by LC/ESI-MS/MS. To calculate the rel-ative ratios, MS-spectra were evaluated by the programXpress. Furthermore, a complementary approach was

used to detect differences in protein abundance, whichcombines ICAT and 2-DE and were quantified by the pro-gram MS-Screener [24]. An iterative search procedure wasapplied for in-depth analysis of large 2-DE/MALDI-MSdatasets [14].

SQL-LIMS™ Proteomics SolutionThe workflow described above requires a suitable systemfor the integration and management of raw and processedexperimental data. These issues were addressed by theLaboratory Information Management System (LIMS) incombination with an implemented SQL*LIMS™ Proteom-ics Solution, customized for our proteomics research lab-oratory. The implemented solution was based on theApplied Biosystems™ product suite for life science, includ-ing a core application (SQL*LIMS™). The latter was

Post-processing: Clustering of polymer contaminantsFigure 4Post-processing: Clustering of polymer contaminants. A cluster dendrogram comprising 873 MS/MS mass spectra is shown. The data were recorded using a comparative ICAT/LC/ESI-MS study of proteasome subtypes from rat liver. The framed part of the dendrogram shows a cluster of 125 similar mass spectra originated by polymer contaminations due to avidin column purification steps. Half of these mass spectra falsely fitted to proteasome peptides using Mascot search algorithm. These junk mass spectra were eliminated before data were stored into SQL*LIMS™.

0.0

0.

2

0

.4

0.6

0.8

1.0

Dis

tan

ce h

eig

ht

Page 8 of 10(page number not for citation purposes)

Page 9: Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Chemistry Central Journal 2009, 3:2 http://journal.chemistrycentral.com/content/3/1/2

designed for analytical laboratories, Pharma R&D andmanufacturing environments. Furthermore, componentsspecifically designed for microtiter plates (SQL*GT™) andproteomics (Proteomics Solution) data management wereimplemented. Operating flexibility and extensibility ofthis solution has minimized the requirement for code cus-tomization. The SQL*LIMS™ users are allowed to enternew or to amend existing workflows and to open inter-faces providing an add-on and built-in mechanism for theintegration of MS instruments and third-party tools. Ahighly integrated environment has been addressed fromthe very beginning as a key factor to enhance productivityby streamlining time consuming operations such as MSdata exchange (work list uploading and peak list down-loading) or protein search engines querying.

Data transfer tool Java interface (DTT)The data transfer tool was designed to facilitate the datatransfer from the SQL*LIMS™ into the public 2-DE data-base, which is the essential part of our Proteome DatabaseSystem http://www.mpiib-berlin.mpg.de/2D-PAGE/. TheDTT has been developed in Java using J2SE 1.4 http://java.sun.com/j2se/1.4 and Eclipse http://www.eclipse.org. The program comprised a graphical userinterface (GUI) to enable the selection of datasets whichwere to be transferred. For safety reasons, the data trans-fers out of SQL-LIMS™ were protected by password acces-sion.

Competing interestsThe authors declare that they have no competing interests.

Authors' contributionsFS and MS carried out the proteomics studies, participatedin the database structure and the template creation, andprepared the manuscript. KPP contributed to the conceptand the realization of the 2D-PAGE and the SQL-LIMSdatabase. MB participated in the DTT tool development.BT participated in the realization of the manuscript. PRJcoordinated and conceived of the study, and participatedin its design. All authors read and approved the final man-uscript.

AcknowledgementsThe authors thank Luigi Colombo from ABI for the support and the BMBF (031U107A/207A) for funding.

References1. Jungblut PR, Holzhütter HG, Apweiler R, Schlüter H: The specia-

tion of the proteome. Chemistry Central Journal 2008, 2:1-10.2. Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R: Quan-

titative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 1999, 17:994-999.

3. Ong SE, Blagoev B, Kratchmarova I, Kristensen DB, Steen H, PandeyA, Mann M: Stable Isotope Labeling by Amino Acids in CellCulture, SILAC, as a Simple and Accurate Approach toExpression Proteomics. Mol Cell Proteomics 2002, 1:376-386.

4. Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S,Khainovski N, Pillai S, Dey S, Daniels S, Purkayastha S, Juhasz P, Martin

S, Bartlet-Jones M, He F, Jacobson A, Pappin DJ: Multiplexed Pro-tein Quantitation in Saccharomyces cerevisiae UsingAmine-Reactive Isobaric Tagging Reagents. Mol Cell Proteomics2004, 3:1154-1169.

5. Gärdén P, Alm R, Häkkinen J: PROTEIOS: an open source pro-teomics initiative. BMC Bioinformatics 2005, 21:2085-2087.

6. Morisawa H, Hirota M, Toda T: Development of an open sourcelaboratory information management system for 2-D gelelectrophoresis-based proteomics workflow. BMC Bioinformat-ics 2006, 7:430.

7. Piggee C: LIMS and the art of MS proteomics. Anal Chemi 2008,1:4801-4806.

8. Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L,Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, Roechert B,Poux S, Jung E, Mersch H, Kersey P, Lappe M, Li Y, Zeng R, Rana D,Nikolski M, Husi H, Brun C, Shanker K, Grant SG, Sander C, Bork P,Zhu W, Pandey A, Brazma A, Jacq B, Vidal M, Sherman D, Legrain P,Cesareni G, Xenarios I, Eisenberg D, Steipe B, Hogue C, Apweiler R:The HUPO PSI's Molecular Interaction format-a communitystandard for the representation of protein interaction data.Nat Biotechnol 2004, 22:77-183.

9. Orchard S, Martens L, Tasman J, Binz BA, Albar JP, Hermjakob H: 6thHUPO Annual World Congress – Proteomics Standards Ini-tiative Workshop 6–10 October 2007, Seoul, Korea. Proteom-ics 2008, 7:1331-1333.

10. Taylor CF, Paton NW, Garwood KL, Kirby PD, Stead DA, Yin Z,Deutsch EW, Selway L, Walker J, Riba-Garcia I, Mohammed S, DeeryMJ, Howard JA, Dunkley T, Aebersold R, Kell DB, Lilley KS, Roep-storff P, Yates JR 3rd, Brass A, Brown AJ, Cash P, Gaskell SJ, HubbardSJ, Oliver SG: A systematic approach to modeling, capturing,and disseminating proteomics experimental data. Nat Biotech-nol 2003, 21:247-254.

11. Pedrioli PG, Eng JK, Hubley R, Vogelzang M, Deutsch EW, Raught B,Pratt B, Nilsson E, Angeletti RH, Apweiler R, Cheung K, Costello CE,Hermjakob H, Huang S, Julian RK, Kapp E, McComb ME, Oliver SG,Omenn G, Paton NW, Simpson R, Smith R, Taylor CF, Zhu W,Aebersold R: A common open representation of mass spec-trometry data and its application to proteomics research.Nat Biotechnol 2004, 22:1459-1466.

12. Deutsch E: mzML: a single, unifying data format for massspectrometer output. Proteomics 2008, 14:2776-2777.

13. Pleissner KP, Eifert T, Buettner S, Schmidt F, Boehme M, Meyer TF,Kaufmann SH, Jungblut PR: Web-accessible proteome databasesfor microbial research. Proteomics 2004, 5:1305-1313.

14. Schmidt F, Schmid M, Mattow J, Facius A, Pleissner KP, Jungblut PR:Iterative data analysis is the key for exhaustive analysis ofpeptide mass fingerprints from proteins separated by two-dimensional electrophoresis. J Am Soc Mass Spectrom 2003,14:943-956.

15. Dahlmann B, Ruppert T, Kuehn L, Merforth S, Kloetzel PM: Differ-ent proteasome subtypes in a single tissue exhibit differentenzymatic properties. J Mol Biol 2000, 10:643-653.

16. Klose J: Protein mapping by combined isoelectric focusingand electrophoresis of mouse tissues A novel approach totesting for induced point mutations in mammals. Humangene-tik 1975, 26:231-243.

17. Prehm J, Jungblut PR, Klose J: Analysis of two dimensional pro-tein patterns using a video camera and a computer. Electro-phoresis 1987, 8:562-572.

18. Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse CM: Electro-spray Ionization-Principles and Practice. Science 1989,246:64-71.

19. Tanaka K, Waki H, Ido Y, Akita S, Yoshida Y, Yoshida T: Protein andpolymer analyses up to m/z 100,000 by laser ionization time-of-flight mass spectrometry. Rapid Commun Mass Spectrom 1988,2:151-153.

20. Karas M, Hillenkamp F: Laser Desorption/Ionization of Proteinswith Molecular Masses Exceeding 100,000 Daltons. Anal Chem1988, 60:2299-2301.

21. Thiede B, Hohenwarter W, Krah A, Mattow J, Schmid M, Schmidt F,Jungblut PR: Peptide mass fingerprinting. Methods 2005,35:237-247.

22. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-basedprotein identification by searching sequence databases usingmass spectrometry data. Electrophoresis 1999, 20:3551-3567.

Page 9 of 10(page number not for citation purposes)

Page 10: Assembling proteomics data as a prerequisite for the analysis of large scale experiments

Chemistry Central Journal 2009, 3:2 http://journal.chemistrycentral.com/content/3/1/2

Open access provides opportunities to our colleagues in other parts of the globe, by allowing

anyone to view the content free of charge.

Publish with ChemistryCentral and everyscientist can read your work free of charge

W. Jeffery Hurst, The Hershey Company.

available free of charge to the entire scientific communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Centralyours you keep the copyright

Submit your manuscript here:http://www.chemistrycentral.com/manuscript/

23. Schmidt F, Donahoe S, Hagens K, Mattow J, Schaible UE, KaufmannSH, Aebersold R, Jungblut PR: Complementary analysis of theMycobacterium tuberculosis proteome by two-dimensionalelectrophoresis and isotope-coded affinity tag technology.Mol Cell Proteomics 2004, 3:24-42.

24. Schmidt F, Dahlmann B, Janek K, Kloss A, Wacker M, Ackermann R,Thiede B, Jungblut PR: Comprehensive quantitative proteomeanalysis of 20S proteasome subtypes from rat liver by iso-tope coded affinity tag and 2-D gel-based approaches. Pro-teomics 2006, 6:4622-4632.

25. Smolka MB, Zhou H, Purkayastha S, Aebersold R: QuantitativeProtein Profiling Using Two-dimensional Gel Electrophore-sis, Isotope-coded Affinity Tag Labeling, and Mass Spectrom-etry. Mol Cell Proteomics 2002, 1:19-29.

Page 10 of 10(page number not for citation purposes)