THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS...

1
THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS SOLUTION FOR ‘OMICS’ CORE FACILITIES Pratik Jagtap 1 ; James Johnson 2 ; Bart Gottchalk 2 ; Getiria Onsongo 2 ; Sricharan Bandhakavi 3 ; Joel Kooren 4 ; Brian Sandri 4 ; Ebbing de Jong 1 ; Todd Markowski 1 ; LeeAnn Higgins 1 ; Chris Wendt 4 ; Joel Rudney 4 and Timothy Griffin 4 1.Center for Mass Spectrometry and Proteomics 2. Minnesota Supercomputing Institute 3. Bio-Rad Laboratories, Richmond, CA 4. University of Minnesota, Minneapolis, MN 55455 INTRODUCTION GALAXYP Galaxy-P Workflow Galaxy-P Tools Galaxy-P has multiple software tools - some proteomics-specific - and others from the genomics Galaxy framework. Integration of different ‘omics’ data reveals novel discoveries into biological systems. However - the need for multiple, disparate software makes the integration of multi-omic data a serious challenge. Extension of Galaxy (a web-based bioinformatics data analysis platform) for mass spectrometric-based proteomics software enables advanced multi-omic applications such as proteogenomics, metaproteomics and quantitative proteomics. PROTEOGENOMICS METAPROTEOMICS QUANTITATIVE PROTEOMICS CONCLUDING REMARKS Salivary and OPML datasets OVERVIEW OF MODULES AND ANALYTICAL WORKFLOWS FOR METAPROTEOMIC ANALYSIS. Tools can be used in a sequential manner to generate workflows that can be reused, shared and creatively modified for multiple studies. Benefits of Galaxy / Galaxy-P: Software accessibility and usability. Share-ability of tools, workflows and histories. Reproducibility and ability to test and compare results after using multiple parameters. METHODS & DATASETS > ENST00000 cdna: TACGGCCGTCGTGCCC > ENST000000 cdna: TCGTGCCGCTTACGGC > ENST00000 protein ITSAPRTEINDATASET > ENST000000 protein INANTHERFRAMETHGH > sp|Acc No 1|Human MANPRTEINS > sp|Acc No 2|Human MANYHMANPRTEINS > ENST00000 protein ITSAPRTEINDATASET > ENST000000 protein INANTHERFRAMETHGH > sp|Acc No 1|Human MANPRTEINS > sp|Acc No 2|Human MANYHMANPRTEINS > ENST00000 protein ITSAPRTEINDATASET > ENST000000 protein INANTHERFRAMETHGH > sp|Acc No 1|Human MANPRTEINS > sp|Acc No 2|Human MANYHMANPRTEINS > ENST00000 protein (decoy) TESATADNIIETRPASTI > ENST000000 protein (decoy) HGHTEMARFREHTNANI > sp|Acc No 1|Human (decoy) SNIETRPNAM > sp|Acc No 2|Human (decoy) SNIETRPNAMHYNAM Peak list processing Database generaGon Twostep Database Search method PTNTIALNEWPRTEFRM PEPTIDESINFSTAFRMT MICRPEPTIDES ARCHAEALPEPTIDES > ENST| PepIdes 830 aas 1 SMALLPEPTIDES ENST| PepIdes 830 aas 2 REALLYSHRTPEPTIDES > ENST| PepIdes 31 and more aas 1 VERYLNGPEPTIDESSPERLNGPEPTIDESFREVER > ENST| PepIdes 830 aas 2 HGEPEPTIDESTHTNEEDSEARCHESWTHWRK IdenGfying pepGdes from microbial db A B Mass spectra Peak list (MGF or mzml) msconvert MGF Forma/er transla1on PepGde Summary Data Processing TargetDecoy database Target database PepGde Summary from SecondStep Search Data Processing Microbial PepGdes Short pepGdes (<30 aas) Long pepGdes (>30 aas) Host Protein Database Metagenomic Database Microbial protein db C D BLASTP Analysis E SECC and Salivary datasets Results Summary COPD dataset WORKFLOW AND TUTORIAL 20 KEGG pathways. Most prevalent pathway : Carbohydrate metabolism. ‘Best-populated’ pathway : Glycolysis (Carbohydrate metabolism). Protein with highest number of reads: Glyceraldehyde-3- phosphate. Dataset Total spectra Distinct peptides of microbial origin Phyla* Genera* Species* Whole human salivary supernatant 988,974 1926 12 65 123 SECC without sucrose 153,019 28,126 5 33 56 SECC with sucrose 139,759 23,029 5 13 33 Analytical transparency Scalability of data RAW files from multiple datasets (see below) were generated from Orbitrap Velos instrument. The processed peak lists were searched using ProteinPilot ™ version 4.5 (AB Sciex) within Galaxy-P (usegalaxyp.org). After optimization and testing, multiple workflows were used in a sequential manner to generate inputs for the subsequent workflow. METAPROTEOMICS Severe Early Childhood Caries (SECC ) dataset for clinical comparison of oral microcosm biofilms grown from plaque either in presence or absence of sucrose. Salivary supernatant dataset - 3D-fractionated with or without ProteoMiner treatment (Bandhakavi et al 2009). 200 RAW files were acquired on LTQ/Orbitrap XL. Both the datasets were searched against the human oral microbiome database (HOMD) using the two- step method (Jagtap et al 2013). PROTEOGENOMICS Salivary supernatant - same as in metaproteomics study above. Oral pre-malignant lesion (OPML ) dataset was collected as brush biopsy sample from six individuals with pre-malignant lesions and a matched control sample from adjacent oral cavity (Kooren et al unpublished). Both the datasets were searched against 3-frame translated cDNA database and the human oral microbial database by using two-step method (Jagtap et al 2013). QUANTITATIVE PROTEOMICS Chronic Obstructive Pulmonary Disease (COPD ) – linked lung cancer tissue samples were collected and subjected to iTRAQ labeling and 2D LC-MS. The dataset was searched against Human UniProt database. OVERVIEW OF MODULES AND ANALYTICAL WORKFLOWS FOR PROTEOGENOMIC ANALYSIS. * Analysis using MEGAN. Salivary ProteoGenomics : 52 novel proteoforms identified (Notably - alternate frame translation for PRB1 and PRB2 proteins SECC Metaproteomics : Analysis of outputs from Galaxy-P workflows and MEGAN analysis is currently ongoing for three replicates Quantitative proteomics : Workflows have been used on multiple replicates. Reproducibility analysis and RNASeq data integration in works. IMMEDIATE PLANS : - Working along with genomics research community and Genomics Center to develop on integration of RNASeq derived workflows for database generation. - Working closely with metagenomics / microbiome research community to develop functional pathway analytical workflows using the metaproteomics data. - Working on correlating RNASeq quantitative information with quantitative iTRAQ proteomic information for both model and non-model organisms. FUTURE PLANS : - Installation and testing of open-source tools (such as MS-GF+ and PeptideShaker). The installation and testing is being carried out through and international collaboration between developers and users. - Improving on current metaproteomic and proteogenomic workflows. Acknowledgements : This work was funded by NSF grant 1147079. Also many thanks to John Chilton (Penn State) for his work in the first year of the project. Shareable workflows: z.umn.edu/peaklistconversion z.umn.edu/dbgen z.umn.edu/mn2stp z.umn.edu/peptidesfromcdnadb z.umn.edu/blastp z.umn.edu/psme z.umn.edu/pep2gtf All together : z.umn.edu/pg140 Normal coding frame AlternaGve frame * QPPRSPRGGQ LHSPLSDSPLDPLDAG QPQQPNGGAPPPPG RLSSPIAEQQLHPV QPPPGQPKGPPS * P KHPHDKHSEQLLDPV QPPPGQPKGPPPPPGQPKNGGQPPPG KHPHDKHSEQLHPVKLNTAEKHPHD QPQQPNGGAPPPPG RLSSPIAEQQLHPV QSKNDGPPPGQPKGPPPPGQ KPSTTEKHPHDKHSEQLHPV QPKGPPSRSSRSKNDG KHSEQLLDLVEPSTTE PRB1 PRB2 Human chromosome 12 11,505 kb 11,549 kb Results Summary Dataset Number of spectra Novel proteoform peptides Novel proteoform peptides filtered after PSM evaluation Number of distinct peptides after visualization and for genome localization. Salivary supernatant 988,974 254 105 52 OPML Control 156,405 904 34 17 OPML Lesion 157,299 887 29 21 A B C D E F G Representation of genomic organization of identified novel proteoform-specific peptides from PRB1 and PRB2 genes on chromosome 12. Shareable workflows: z.umn.edu/peaklistconversion z.umn.edu/dbgenmp z.umn.edu/mn2stp z.umn.edu/pepfrommicrobialdb z.umn.edu/blastp All together : z.umn.edu/mp65 A C D B E > ENST|PotenGal new Microbe1 PTNTIALNEWPRTEFRM > ENST|PotenGal new Microbe2 PEPTIDESINFSTAFRMT PepGdes in FASTA format Submit for MEGAN Anaysis Submit to UniPept for analysis Peak list processing A Database generaGon B Twostep Database Search method C IdenGfying pepGdes from translated cDNA db D PepGde to GTF conversion PepGde Summary of new proteoform pepGdes with quality pepGde spectral matching characterisGcs. General Transfer Format file for Human genome Pep1de to GTF G VisualizaGon in genomic context PepGde Spectral Match EvaluaGon F Filter pepGdes with mismatches to human NCBI database. PepGde Summary of new proteoform pepGdes. Peak list (mzml) Pep1de Spectral Match Evaluator Spectral VisualizaGon Filtering of PepGde Spectral Matching Metrics BLASTP Analysis E Tutorial: z.umn.edu/ppingp Tabular Output Input Parameters HTML Output PEPTIDE SPECTRAL MATCH EVALUATION GENOMIC CONTEXT VISUALIZATION Screenshot of a novel proteoformspecific pep1de within Integrated Genomic Viewer. Spectral Visualization

Transcript of THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS...

Page 1: THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS …galaxyp.org/wp-content/uploads/2017/08/Multiomics... · THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS SOLUTION FOR ‘OMICS’

THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS SOLUTION FOR ‘OMICS’ CORE FACILITIES Pratik Jagtap1; James Johnson2; Bart Gottchalk2; Getiria Onsongo2; Sricharan Bandhakavi3; Joel Kooren4; Brian Sandri4; Ebbing de Jong1; Todd Markowski1; LeeAnn Higgins1; Chris Wendt4; Joel Rudney4 and Timothy Griffin4 1.Center for Mass Spectrometry and Proteomics 2. Minnesota Supercomputing Institute 3. Bio-Rad Laboratories, Richmond, CA 4. University of Minnesota, Minneapolis, MN 55455

INTRODUCTION  

GALAXY-­‐P  Galaxy-P Workflow Galaxy-P Tools

Galaxy-P has multiple software tools - some proteomics-specific - and others from the genomics Galaxy framework.

•  Integration of different ‘omics’ data reveals novel discoveries into biological systems.

•  However - the need for multiple, disparate software makes the integration of multi-omic data a serious challenge.

•  Extension of Galaxy (a web-based bioinformatics data analysis platform) for mass spectrometric-based proteomics software enables advanced multi-omic applications such as proteogenomics, metaproteomics and quantitative proteomics.

PROTEOGENOMICS  METAPROTEOMICS   QUANTITATIVE  PROTEOMICS  

       

CONCLUDING  REMARKS  

Salivary and OPML datasets OVERVIEW OF MODULES AND ANALYTICAL WORKFLOWS FOR METAPROTEOMIC ANALYSIS.

Tools can be used in a sequential manner to generate workflows that can be reused, shared and creatively modified for multiple studies.

Benefits of Galaxy / Galaxy-P: •  Software accessibility and usability. •   Share-ability of tools, workflows and histories. •  Reproducibility and ability to test and compare results after using multiple parameters.

METHODS  &  DATASETS  

>  ENST00000  cdna:  TACGGCCGTCGTGCCC  >  ENST000000  cdna:  TCGTGCCGCTTACGGC                                            

>  ENST00000  protein  ITSAPRTEINDATASET  >  ENST000000  protein  INANTHERFRAMETHGH  

>  sp|Acc  No  1|Human  MANPRTEINS  >  sp|Acc  No  2|Human  MANYHMANPRTEINS  

>  ENST00000  protein  ITSAPRTEINDATASET  >  ENST000000  protein  INANTHERFRAMETHGH  

>  sp|Acc  No  1|Human  MANPRTEINS  >  sp|Acc  No  2|Human  MANYHMANPRTEINS  

>  ENST00000  protein  ITSAPRTEINDATASET  >  ENST000000  protein  INANTHERFRAMETHGH  

>  sp|Acc  No  1|Human  MANPRTEINS  >  sp|Acc  No  2|Human  MANYHMANPRTEINS  

>  ENST00000  protein  (decoy)  TESATADNIIETRPASTI  >  ENST000000  protein  (decoy)  HGHTEMARFREHTNANI  

>  sp|Acc  No  1|Human  (decoy)  SNIETRPNAM  >  sp|Acc  No  2|Human  (decoy)  SNIETRPNAMHYNAM    

Peak  list  processing  

Database  generaGon  

Two-­‐step  Database    Search  method  

PTNTIALNEWPRTEFRM  PEPTIDESINFSTAFRMT  MICRPEPTIDES  ARCHAEALPEPTIDES  

>  ENST|  PepIdes  8-­‐30  aas  1    SMALLPEPTIDES  Ø  ENST|  PepIdes  8-­‐30  aas  2  REALLYSHRTPEPTIDES  

>  ENST|  PepIdes  31  and  more  aas  1    VERYLNGPEPTIDESSPERLNGPEPTIDESFREVER  >  ENST|  PepIdes  8-­‐30  aas  2  HGEPEPTIDESTHTNEEDSEARCHESWTHWRK      

IdenGfying  pepGdes  from          microbial  db  

A  

B  

Mass  spectra     Peak  list  (MGF  or  mzml)  

msconvert  MGF  

Forma/er  

transla1on  

PepGde  Summary  Data  

Processing  

Target-­‐Decoy  database  

Target  database  

PepGde  Summary  from  Second-­‐Step  

Search  

Data  Processing  

Microbial    PepGdes  

Short  pepGdes  (<30  aas)  

Long  pepGdes  (>30  aas)  

Host  Protein  Database    

Metagenomic  Database    

Microbial  protein  db  

C  

D  BLAST-­‐P  Analysis  E  

SECC and Salivary datasets

Results Summary

COPD dataset WORKFLOW AND TUTORIAL

• 20 KEGG pathways. •  Most prevalent pathway : Carbohydrate metabolism. •  ‘Best-populated’ pathway : Glycolysis (Carbohydrate metabolism). •  Protein with highest number of reads: Glyceraldehyde-3-phosphate.

Dataset Total spectra

Distinct peptides of microbial origin Phyla* Genera* Species*

Whole human salivary supernatant 988,974 1926 12 65 123 SECC without sucrose 153,019 28,126 5 33 56 SECC with sucrose 139,759 23,029 5 13 33

•  Analytical transparency •  Scalability of data

RAW files from multiple datasets (see below) were generated from Orbitrap Velos instrument. The processed peak lists were searched using ProteinPilot ™ version 4.5 (AB Sciex) within Galaxy-P (usegalaxyp.org). After optimization and testing, multiple workflows were used in a sequential manner to generate inputs for the subsequent workflow.

METAPROTEOMICS •  Severe Early Childhood Caries (SECC) dataset for clinical comparison of oral microcosm

biofilms grown from plaque either in presence or absence of sucrose. •  Salivary supernatant dataset - 3D-fractionated with or without ProteoMiner treatment

(Bandhakavi et al 2009). 200 RAW files were acquired on LTQ/Orbitrap XL. Both the datasets were searched against the human oral microbiome database (HOMD) using the two-step method (Jagtap et al 2013).

PROTEOGENOMICS •  Salivary supernatant - same as in metaproteomics study above. •  Oral pre-malignant lesion (OPML) dataset was collected as brush biopsy sample from six

individuals with pre-malignant lesions and a matched control sample from adjacent oral cavity (Kooren et al unpublished). Both the datasets were searched against 3-frame translated cDNA database and the human oral microbial database by using two-step method (Jagtap et al 2013).

QUANTITATIVE PROTEOMICS •  Chronic Obstructive Pulmonary Disease (COPD) – linked lung cancer tissue samples were

collected and subjected to iTRAQ labeling and 2D LC-MS. The dataset was searched against Human UniProt database.

OVERVIEW OF MODULES AND ANALYTICAL WORKFLOWS FOR PROTEOGENOMIC ANALYSIS.

* Analysis using MEGAN.

•  Salivary ProteoGenomics : 52 novel proteoforms identified (Notably - alternate frame translation for PRB1 and PRB2 proteins

•  SECC Metaproteomics: Analysis of outputs from Galaxy-P workflows and MEGAN analysis is currently ongoing for three replicates

•  Quantitative proteomics : Workflows have been used on multiple replicates. Reproducibility analysis and RNASeq data integration in works.

•  IMMEDIATE PLANS: -  Working along with genomics research community and Genomics Center

to develop on integration of RNASeq derived workflows for database generation.

-  Working closely with metagenomics / microbiome research community to develop functional pathway analytical workflows using the metaproteomics data.

-  Working on correlating RNASeq quantitative information with quantitative iTRAQ proteomic information for both model and non-model organisms.

•  FUTURE PLANS: -  Installation and testing of open-source tools (such as MS-GF+ and

PeptideShaker). The installation and testing is being carried out through and international collaboration between developers and users.

-  Improving on current metaproteomic and proteogenomic workflows.

Acknowledgements : This work was funded by NSF grant 1147079. Also many thanks to John Chilton (Penn State) for his work in the first year of the project.

Shareable workflows: z.umn.edu/peaklistconversion z.umn.edu/dbgen z.umn.edu/mn2stp z.umn.edu/peptidesfromcdnadb z.umn.edu/blastp z.umn.edu/psme z.umn.edu/pep2gtf All together : z.umn.edu/pg140

Normal  coding  frame  AlternaGve  frame  

*QPPRSPRGGQ  LHSPLSDSPLDPLDAG  

QPQQPNGGAPPPPG  

RLSSPIAEQQLHPV  

QPPPGQPKGPPS*P  KHPHDKHSEQLLDPV  

QPPPGQPKGPPPPPGQPKNGGQPPPG  

KHPHDKHSEQLHPVKLNTAEKHPHD  

QPQQPNGGAPPPPG  

RLSSPIAEQQLHPV  

QSKNDGPPPGQPKGPPPPGQ  

KPSTTEKHPHDKHSEQLHPV  

QPKGPPSRSSRSKNDG  KHSEQLLDLVEPSTTE  

PRB1 PRB2

Human  chromosome  12  

11,505  kb   11,549  kb  

Results Summary

Dataset Number of spectra

Novel proteoform

peptides

Novel proteoform peptides filtered after PSM evaluation

Number of distinct peptides after

visualization and for genome localization.

Salivary supernatant 988,974 254 105 52

OPML Control 156,405 904 34 17 OPML Lesion 157,299 887 29 21

A   B  

C   D   E  F   G  

Representation of genomic organization of identified novel proteoform-specific peptides from PRB1 and PRB2 genes on chromosome 12.

Shareable workflows: z.umn.edu/peaklistconversion z.umn.edu/dbgenmp z.umn.edu/mn2stp z.umn.edu/pepfrommicrobialdb z.umn.edu/blastp All together : z.umn.edu/mp65

A  

C   D  

B  

E  

>  ENST|PotenGal  new  Microbe1    PTNTIALNEWPRTEFRM  >  ENST|PotenGal  new  Microbe2  PEPTIDESINFSTAFRMT  

PepGdes  in    FASTA  format  

Submit  for  MEGAN  Anaysis    

Submit  to  UniPept  for  analysis    

Peak  list    processing  A   Database    

generaGon  B   Two-­‐step  Database    Search  method  C  

IdenGfying  pepGdes  from  translated  cDNA  db  D  

PepGde  to  GTF  conversion  

PepGde  Summary  of  new  proteoform  pepGdes  with  quality  pepGde  spectral  matching  characterisGcs.  

General  Transfer  Format  file  for  Human  genome  

Pep1de  to  GTF  

G  

VisualizaGon  in  genomic  context  

PepGde  Spectral  Match  EvaluaGon  

F  Filter  pepGdes  

with  mismatches  to  human  NCBI  

database.  

PepGde  Summary  of  new  proteoform  pepGdes.  

Peak  list  (mzml)  

Pep1de  Spectral  Match  Evaluator  

Spectral  VisualizaGon  

Filtering  of  PepGde  Spectral  Matching  Metrics  

BLAST-­‐P  Analysis  E  

Tutorial: z.umn.edu/ppingp

Tabular Output

Input Parameters

HTML Output

PEPTIDE SPECTRAL MATCH EVALUATION GENOMIC CONTEXT VISUALIZATION

Screenshot  of    a  novel  proteoform-­‐specific  pep1de  within  Integrated  Genomic  Viewer.  Spectral

Visualization