INTRODUCTION

1
INTRODUCTION We connect, in a complete pipeline, an ontology- based environment for proteomics spectra management with a distributed complete validation platform for predictive analysis. We leverage from two existing software platforms (MS-Analyzer and BioDCV) and from emerging proteomics standards. In the set-up, BioDCV is accessed from the MS-Analyzer workflow as a service, thus providing a complete pipeline for proteomics data analysis. Predictive classifica- tion studies on MALDI-TOF data based on this pipeline are presented. REFERENCES [1] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Using ontologies for preprocessing and mining spectra data on the Grid,FGCS, 2006, In press, http://dx.doi.org/10.1016/j.future.2006 .04.011 [2] M. Cannataro, P.H. Guzzi, T. Mazza, G. Tradigo, P. Veltri. Preprocessing of Mass Spectrometry Proteomics Data on the Grid. IEEE CBMS 2005: 549-554 [3] C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Transactions on Computational Biology and Bioinformatics, 2(2):110-118, 2005. [4] A.Barla, B.Irler, S.Merler, G. Jurman, S.Paoli and C. Furlanello, Proteome profiling without selection bias. IEEE CBMS 2006, 941—946 [5] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Kong, and Q. Le. Sample classification from protein mass spectometry, by ”peak probability contrasts”. Bioinformatics, 20(17):3034–3044, 2004. Workflows, ontologies and standards for unbiased prediction in high-throughput proteomics Cannataro M*, Barla A**, Gallo A*, Paoli S**, Jurman G**, Merler S**, Veltri P*, Furlanello C**. *University Magna Graecia of Catanzaro, Italy, **ITC-irst, Trento, Italy MGED 9 MGED 9 September 7-10, 2006 Seattle, WA, U.S.A. DATASETS D1. MALDI-TOF Ovarian Cancer Dataset, from (www-stat.stanford.edu/~tibs/PPC/ Rdist)[5] 49 samples (24 diseased + 25 controls) Each raw sample has 56384 m/z measurements (892 KB) Each preprocessed sample has 564 m/z measurements (19 KB) Preprocessing: Normalization Binning Biomarker identification Baseline subtraction Peak Alignment – Clustering 67 features identified D2. (lab calibration sample) MALDI-TOF, human serum, 20 technical replicates, 10 control samples, 10 with 2 proteins, 34671 measurement, 347 m/z after preprocessing, predictive discrimination with 7 peaks MS-ANALYZER MS-Analyzer [1] is a platform for the integrated management and processing of proteomics spectra data. It supports the ontology based design of “in silico” proteomics studies: ontologies are used to model software tools and spectra data, while workflows are used to model applications. MS- Analyzer uses a specialized spectra database and provides a set of pre-processing services: • Interface to heterogeneous mass spectrometers formats such as MALDI-TOF, SELDI-TOF, ICAT- based LC-MS/MS. Formats are unified into mzData, in compliance with the HUPO-PSI proteomics standardization initiative. • Acquisition, storage, and management of MS data with the SpecDB database. Spectra are stored in their different stages (raw, pre- processed, prepared). Single, multiple, or portions of spectra can be queried (in- database preprocessing). • Preprocessing of MS data (smoothing, baseline subtraction, normalization, binning, peaks alignment), as well as spectra preparation for further data mining (spectra to ARFF conversion) [2]. WS RSR PPSR PSR raw spectra pre-processed spectra prepared spectra SpecDB APIs Ontology-based Workflow Designer Ontology Assistant - browsing - querying WF Editor -composition -browsing -selection -visualization WF Schema Abstract, Concrete WF Resource Discovery Services WF Translator WF Scheduler WF Monitor Workflow Scheduler Ontology manager Ontologie s UDDI/MDS Metadata WSDL WS 1 WS 2 Spectra Management Services Network WS 1 WS 2 Spectra Visualization Services WS 1 WS 2 Spectra Preparation Services WS 1 WS 2 Spectra PreprocessingS ervices 1 M-WS Ontology-based Workflow Designer BIODcv WS BioDCV WS front-end Server FTP repository • Data • Metadata • Repository URL • email • DMZ Server Apache mod_Python ZSI module BIODCV The predictive modeling portion of the proposed system is provided by BioDCV, the ITC-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E- RFE, an entropy based acceleration of the SVM- RFE feature ranking procedure [3]. For proteomics, it includes methods for baseline subtraction, spectra alignment, peak clustering and peak assignment that were adapted from existing R packages and concatenated to the complete validation system. BioDCV is also a grid application and it has been used in production within the EGEE Biomed VO [4]. FEATURE EXTRACTION • Within sample • across sample Complete Validation R scripts visualization ATE, sampletracking PHP • biomarker lists • HTML publication • Biomarkers data • REPORT ACKNOWLEDGMENTS ITC-irst: R Flor, D Albanese, B Irler UniCZ: G. Cuda, M. Gaspari, PH Guzzi,T Mazza Three Internet Web Services are used to integrate remotely the two main system components. The BioDCV component is invoked from the MSAnalyzer workflow as a WebService (biodcv-ws-client) in the UniCZ network: data and metadata are copied in a FTP repository, then the data URL and a notification email address are transmitted to the BioDCV WebService (biodcv-ws) on a DMZ area of the ITC-irst network. This service is directly run by Apache with Mod_Python and the Zolera Soap infrastructure. The incoming data are transferred to the internal front-end server (server-cz-tn.py) within the firewalled area. The front-end launches first the feature extraction module and then a full complete validation process using the BioDCV component. The system outputs are thus formatted as graphs and tables by R and PHP scripts. The results are published by the front-end on the DMZ server, and notified back by email. WEB SERVICES ARCHITECTURE n ATE 10 20 30 40 1 5 10 15 20 30 40 50 67 Number of features E(S) 0.0 0.5 1.0 1 5 50n1 1: S0 (26) 1 5 50n1 2: S1 (28) 1 5 50n1 3: S2 (27) 1 5 50n1 4: S3 (25) 1 5 50n1 5: S4 (26) 0.0 0.5 1.0 1 5 50n1 6: S5 (35) 1 5 50n1 7: S6 (19) 1 5 50n1 8: S7 (32) 1 5 50n1 9: S8 (31) 1 5 50n1 10: S9 (30) 0.0 0.5 1.0 1 5 50n1 11: S10 (24) 1 5 50n1 12: S11 (22) 1 5 50n1 13: S12 (22) 1 5 50n1 14: S13 (24) 1 5 50n1 15: S14 (20) 0.0 0.5 1.0 1 5 50n1 16: S15 (27) 1 5 50n1 17: S16 (24) 1 5 50n1 18: S17 (22) 1 5 50n1 19: S18 (26) 1 5 50n1 20: S19 (18) 0.0 0.5 1.0 1 5 50n1 21: S20 (27) 1 5 50n1 22: S21 (25) 1 5 50n1 23: S22 (19) 1 5 50n1 24: S23 (21) 1 5 50n1 25: S24 (23) Error rate (tumour tissue) Error rate (non- tumoural tissue) No-information error rate 1 The BioDCV system: EGEE BioMed VO 2-50 MB 50-400 MB grid-ftp scp grid-ftp grid-ftp grid-ftp scp Commands: 1.grid-url-copy/lcg-cp db from local to SE 2.edg-job-submit BioDCV.jdl 3.grid-url-copy/lcg-cp db from SE to local D2: mean A m/z Intensity 9100 9120 9140 9160 9180 9200 0 1000 2000 3000 4000 D2: .95 Student bootstrap CI D2: mean B D2: .95 Student bootstrap CI 9133,17 Da

description

Ontology-based Workflow Designer. Ontology Assistant browsing querying. WF Editor composition browsing selection visualization. WS 1. WS 1. WS 1. WS 1. Network. WS 2. WS 2. WS 2. WS 2. Spectra PreprocessingServices. Spectra Preparation Services. Spectra Management Services. - PowerPoint PPT Presentation

Transcript of INTRODUCTION

Page 1: INTRODUCTION

INTRODUCTION

We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation platform for predictive analysis. We leverage from two existing software platforms (MS-Analyzer and BioDCV) and from emerging proteomics standards. In the set-up, BioDCV is accessed from the MS-Analyzer workflow as a service, thus providing a complete pipeline for proteomics data analysis. Predictive classifica-tion studies on MALDI-TOF data based on this pipeline are presented.

REFERENCES

[1] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Using ontologies for preprocessing and mining spectra data on the Grid,FGCS, 2006, In press, http://dx.doi.org/10.1016/j.future.2006.04.011

[2] M. Cannataro, P.H. Guzzi, T. Mazza, G. Tradigo, P. Veltri. Preprocessing of Mass Spectrometry Proteomics Data on the Grid. IEEE CBMS 2005: 549-554

[3] C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Transactions on Computational Biology and Bioinformatics, 2(2):110-118, 2005.

[4] A.Barla, B.Irler, S.Merler, G. Jurman, S.Paoli and C. Furlanello, Proteome profiling without selection bias. IEEE CBMS 2006, 941—946

[5] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Kong, and Q. Le. Sample classification from protein mass spectometry, by ”peak probability contrasts”. Bioinformatics, 20(17):3034–3044, 2004.

Workflows, ontologies and standards for unbiased prediction in high-throughput proteomics

Cannataro M*, Barla A**, Gallo A*, Paoli S**, Jurman G**, Merler S**, Veltri P*, Furlanello C**. *University Magna Graecia of Catanzaro, Italy, **ITC-irst, Trento, Italy

MGED MGED 99

September 7-10, 2006 Seattle, WA,

U.S.A.

DATASETS

D1. MALDI-TOF Ovarian Cancer Dataset, from (www-stat.stanford.edu/~tibs/PPC/ Rdist)[5]

• 49 samples (24 diseased + 25 controls)• Each raw sample has 56384 m/z

measurements (892 KB)• Each preprocessed sample has

564 m/z measurements (19 KB)• Preprocessing:

• Normalization• Binning

• Biomarker identification• Baseline subtraction• Peak Alignment – Clustering• 67 features identified

D2. (lab calibration sample) MALDI-TOF, human serum, 20 technical

replicates, 10 control samples, 10 with 2 proteins, 34671 measurement, 347 m/z after preprocessing, predictive discrimination with 7 peaks

MS-ANALYZER

MS-Analyzer [1] is a platform for the integrated management and processing of proteomics spectra data. It supports the ontology based design of “in silico” proteomics studies: ontologies are used to model software tools and spectra data, while workflows are used to model applications. MS-Analyzer uses a specialized spectra database and provides a set of pre-processing services:

• Interface to heterogeneous mass spectrometers formats such as MALDI-TOF, SELDI-TOF, ICAT-based LC-MS/MS. Formats are unified into mzData, in compliance with the HUPO-PSI proteomics standardization initiative.

• Acquisition, storage, and management of MS data with the SpecDB database. Spectra are stored in their different stages (raw, pre-processed, prepared). Single, multiple, or portions of spectra can be queried (in-database preprocessing).

• Preprocessing of MS data (smoothing, baseline subtraction, normalization, binning, peaks alignment), as well as spectra preparation for further data mining (spectra to ARFF conversion) [2].

• Sharing of experiments data, workflows and knowledge

WS

RSR PPSRPSR

raw spectra

pre-processedspectra

preparedspectra

SpecDB APIs

Ontology-based Workflow Designer

Ontology Assistant- browsing- querying

WF Editor-composition-browsing-selection-visualization

WF SchemaAbstract,

Concrete WF

ResourceDiscoveryServices

WF Translator

WF Scheduler

WF Monitor

Workflow Scheduler

Ontology manager

Ontologies

UDDI/MDS

MetadataWSDL

WS1

WS2

Spectra Management

Services

Network

WS1

WS2

Spectra Visualization

Services

WS1

WS2

Spectra Preparation

Services

WS1

WS2

Spectra Preprocessing

Services

11

M-WS

Ontology-based Workflow Designer

BIODcv WS

BioDCV WSfront-end

Server

FTP repositoryFTP repository

• Data• Metadata

• Repository URL• email

• DMZ Server

Apachemod_Python ZSI module

BIODCV

The predictive modeling portion of the proposed system is provided by BioDCV, the ITC-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E-RFE, an entropy based acceleration of the SVM-RFE feature ranking procedure [3].

For proteomics, it includes methods for baseline subtraction, spectra alignment, peak clustering and peak assignment that were adapted from existing R packages and concatenated to the complete validation system.

BioDCV is also a grid application and it has been used in production within the EGEE Biomed VO [4].

FEATUREEXTRACTION

• Within sample

• across sample

Complete Validation

R scripts

• visualizationATE, sampletracking

PHP

• biomarker lists

• HTML publication

• Biomarkers data• REPORT

ACKNOWLEDGMENTS

• ITC-irst: R Flor, D Albanese, B Irler • UniCZ: G. Cuda, M. Gaspari, PH Guzzi,T

Mazza

Three Internet Web Services are used to integrate remotely the two main system components.

The BioDCV component is invoked from the MSAnalyzer workflow as a WebService (biodcv-ws-client) in the UniCZ network: data and metadata are copied in a FTP repository, then the data URL and a notification email address are transmitted to the BioDCV WebService (biodcv-ws) on a DMZ area of the ITC-irst network.

This service is directly run by Apache with Mod_Python and the Zolera Soap infrastructure. The incoming data are transferred to the internal front-end server (server-cz-tn.py) within the firewalled area.

The front-end launches first the feature extraction module and then a full complete validation process using the BioDCV component. The system outputs are thus formatted as graphs and tables by R and PHP scripts. The results are published by the front-end on the DMZ server, and notified back by email.

WEB SERVICESARCHITECTURE

n

AT

E

10

20

30

40

1 5 10 15 20 30 40 50 67

Number of features

E(S

)

0.0

0.5

1.0

1 5 50n1

1: S0 (26)

1 5 50n1

2: S1 (28)

1 5 50n1

3: S2 (27)

1 5 50n1

4: S3 (25)

1 5 50n1

5: S4 (26)

0.0

0.5

1.0

1 5 50n1

6: S5 (35)

1 5 50n1

7: S6 (19)

1 5 50n1

8: S7 (32)

1 5 50n1

9: S8 (31)

1 5 50n1

10: S9 (30)

0.0

0.5

1.0

1 5 50n1

11: S10 (24)

1 5 50n1

12: S11 (22)

1 5 50n1

13: S12 (22)

1 5 50n1

14: S13 (24)

1 5 50n1

15: S14 (20)

0.0

0.5

1.0

1 5 50n1

16: S15 (27)

1 5 50n1

17: S16 (24)

1 5 50n1

18: S17 (22)

1 5 50n1

19: S18 (26)

1 5 50n1

20: S19 (18)

0.0

0.5

1.0

1 5 50n1

21: S20 (27)

1 5 50n1

22: S21 (25)

1 5 50n1

23: S22 (19)

1 5 50n1

24: S23 (21)

1 5 50n1

25: S24 (23)

Error rate (tumour tissue)

Error rate (non- tumoural tissue)

No-information error rate

11

The BioDCV system: EGEE BioMed VO

2-50 MB

50-400 MB

grid-ftp

scpgrid-ftp

grid-ftp

grid-ftp

scp

Commands:1.grid-url-copy/lcg-cp db from local to SE2.edg-job-submit BioDCV.jdl3.grid-url-copy/lcg-cp db from SE to local

D2: mean A

m/z

Inte

nsity

9100 9120 9140 9160 9180 9200

01

000

200

03

000

400

0 D2: .95 Student bootstrap CI

D2: mean B

D2: .95 Student bootstrap CI

9133,17 Da