THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS...

THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS SOLUTION FOR ‘OMICS’ CORE FACILITIES Pratik Jagtap1; James Johnson2; Bart Gottchalk2; Getiria Onsongo2; Sricharan Bandhakavi3; Joel Kooren4; Brian Sandri4; Ebbing de Jong1; Todd Markowski1; LeeAnn Higgins1; Chris Wendt4; Joel Rudney4 and Timothy Griffin4 1.Center for Mass Spectrometry and Proteomics 2. Minnesota Supercomputing Institute 3. Bio-Rad Laboratories, Richmond, CA 4. University of Minnesota, Minneapolis, MN 55455

INTRODUCTION

GALAXY-‐P Galaxy-P Workflow Galaxy-P Tools

Galaxy-P has multiple software tools - some proteomics-specific - and others from the genomics Galaxy framework.

•  Integration of different ‘omics’ data reveals novel discoveries into biological systems.

•  However - the need for multiple, disparate software makes the integration of multi-omic data a serious challenge.

•  Extension of Galaxy (a web-based bioinformatics data analysis platform) for mass spectrometric-based proteomics software enables advanced multi-omic applications such as proteogenomics, metaproteomics and quantitative proteomics.

PROTEOGENOMICS METAPROTEOMICS QUANTITATIVE PROTEOMICS

CONCLUDING REMARKS

Salivary and OPML datasets OVERVIEW OF MODULES AND ANALYTICAL WORKFLOWS FOR METAPROTEOMIC ANALYSIS.

Tools can be used in a sequential manner to generate workflows that can be reused, shared and creatively modified for multiple studies.

Benefits of Galaxy / Galaxy-P: •  Software accessibility and usability. •   Share-ability of tools, workflows and histories. •  Reproducibility and ability to test and compare results after using multiple parameters.

METHODS & DATASETS

> ENST00000 cdna: TACGGCCGTCGTGCCC > ENST000000 cdna: TCGTGCCGCTTACGGC

> ENST00000 protein ITSAPRTEINDATASET > ENST000000 protein INANTHERFRAMETHGH

> sp|Acc No 1|Human MANPRTEINS > sp|Acc No 2|Human MANYHMANPRTEINS





> ENST00000 protein (decoy) TESATADNIIETRPASTI > ENST000000 protein (decoy) HGHTEMARFREHTNANI

> sp|Acc No 1|Human (decoy) SNIETRPNAM > sp|Acc No 2|Human (decoy) SNIETRPNAMHYNAM

Peak list processing

Database generaGon

Two-‐step Database Search method

PTNTIALNEWPRTEFRM PEPTIDESINFSTAFRMT MICRPEPTIDES ARCHAEALPEPTIDES

> ENST| PepIdes 8-‐30 aas 1 SMALLPEPTIDES Ø  ENST| PepIdes 8-‐30 aas 2 REALLYSHRTPEPTIDES

> ENST| PepIdes 31 and more aas 1 VERYLNGPEPTIDESSPERLNGPEPTIDESFREVER > ENST| PepIdes 8-‐30 aas 2 HGEPEPTIDESTHTNEEDSEARCHESWTHWRK

IdenGfying pepGdes from microbial db

A

B

Mass spectra Peak list (MGF or mzml)

msconvert MGF

Forma/er

transla1on

PepGde Summary Data

Processing

Target-‐Decoy database

Target database

PepGde Summary from Second-‐Step

Search

Data Processing

Microbial PepGdes

Short pepGdes (<30 aas)

Long pepGdes (>30 aas)

Host Protein Database

Metagenomic Database

Microbial protein db

C

D BLAST-‐P Analysis E

SECC and Salivary datasets

Results Summary

COPD dataset WORKFLOW AND TUTORIAL

• 20 KEGG pathways. •  Most prevalent pathway : Carbohydrate metabolism. •  ‘Best-populated’ pathway : Glycolysis (Carbohydrate metabolism). •  Protein with highest number of reads: Glyceraldehyde-3-phosphate.

Dataset Total spectra

Distinct peptides of microbial origin Phyla* Genera* Species*

Whole human salivary supernatant 988,974 1926 12 65 123 SECC without sucrose 153,019 28,126 5 33 56 SECC with sucrose 139,759 23,029 5 13 33

•  Analytical transparency •  Scalability of data

RAW files from multiple datasets (see below) were generated from Orbitrap Velos instrument. The processed peak lists were searched using ProteinPilot ™ version 4.5 (AB Sciex) within Galaxy-P (usegalaxyp.org). After optimization and testing, multiple workflows were used in a sequential manner to generate inputs for the subsequent workflow.

METAPROTEOMICS •  Severe Early Childhood Caries (SECC) dataset for clinical comparison of oral microcosm

biofilms grown from plaque either in presence or absence of sucrose. •  Salivary supernatant dataset - 3D-fractionated with or without ProteoMiner treatment

(Bandhakavi et al 2009). 200 RAW files were acquired on LTQ/Orbitrap XL. Both the datasets were searched against the human oral microbiome database (HOMD) using the two-step method (Jagtap et al 2013).

PROTEOGENOMICS •  Salivary supernatant - same as in metaproteomics study above. •  Oral pre-malignant lesion (OPML) dataset was collected as brush biopsy sample from six

individuals with pre-malignant lesions and a matched control sample from adjacent oral cavity (Kooren et al unpublished). Both the datasets were searched against 3-frame translated cDNA database and the human oral microbial database by using two-step method (Jagtap et al 2013).

QUANTITATIVE PROTEOMICS •  Chronic Obstructive Pulmonary Disease (COPD) – linked lung cancer tissue samples were

collected and subjected to iTRAQ labeling and 2D LC-MS. The dataset was searched against Human UniProt database.

OVERVIEW OF MODULES AND ANALYTICAL WORKFLOWS FOR PROTEOGENOMIC ANALYSIS.

* Analysis using MEGAN.

•  Salivary ProteoGenomics : 52 novel proteoforms identified (Notably - alternate frame translation for PRB1 and PRB2 proteins

•  SECC Metaproteomics: Analysis of outputs from Galaxy-P workflows and MEGAN analysis is currently ongoing for three replicates

•  Quantitative proteomics : Workflows have been used on multiple replicates. Reproducibility analysis and RNASeq data integration in works.

•  IMMEDIATE PLANS: -  Working along with genomics research community and Genomics Center

to develop on integration of RNASeq derived workflows for database generation.

-  Working closely with metagenomics / microbiome research community to develop functional pathway analytical workflows using the metaproteomics data.

-  Working on correlating RNASeq quantitative information with quantitative iTRAQ proteomic information for both model and non-model organisms.

•  FUTURE PLANS: -  Installation and testing of open-source tools (such as MS-GF+ and

PeptideShaker). The installation and testing is being carried out through and international collaboration between developers and users.

-  Improving on current metaproteomic and proteogenomic workflows.

Acknowledgements : This work was funded by NSF grant 1147079. Also many thanks to John Chilton (Penn State) for his work in the first year of the project.

Shareable workflows: z.umn.edu/peaklistconversion z.umn.edu/dbgen z.umn.edu/mn2stp z.umn.edu/peptidesfromcdnadb z.umn.edu/blastp z.umn.edu/psme z.umn.edu/pep2gtf All together : z.umn.edu/pg140

Normal coding frame AlternaGve frame

*QPPRSPRGGQ LHSPLSDSPLDPLDAG

QPQQPNGGAPPPPG

RLSSPIAEQQLHPV

QPPPGQPKGPPS*P KHPHDKHSEQLLDPV

QPPPGQPKGPPPPPGQPKNGGQPPPG

KHPHDKHSEQLHPVKLNTAEKHPHD

QPQQPNGGAPPPPG

RLSSPIAEQQLHPV

QSKNDGPPPGQPKGPPPPGQ

KPSTTEKHPHDKHSEQLHPV

QPKGPPSRSSRSKNDG KHSEQLLDLVEPSTTE

PRB1 PRB2

Human chromosome 12

11,505 kb 11,549 kb

Results Summary

Dataset Number of spectra

Novel proteoform

peptides

Novel proteoform peptides filtered after PSM evaluation

Number of distinct peptides after

visualization and for genome localization.

Salivary supernatant 988,974 254 105 52

OPML Control 156,405 904 34 17 OPML Lesion 157,299 887 29 21

A B

C D E F G

Representation of genomic organization of identified novel proteoform-specific peptides from PRB1 and PRB2 genes on chromosome 12.

Shareable workflows: z.umn.edu/peaklistconversion z.umn.edu/dbgenmp z.umn.edu/mn2stp z.umn.edu/pepfrommicrobialdb z.umn.edu/blastp All together : z.umn.edu/mp65

A

C D

B

E

> ENST|PotenGal new Microbe1 PTNTIALNEWPRTEFRM > ENST|PotenGal new Microbe2 PEPTIDESINFSTAFRMT

PepGdes in FASTA format

Submit for MEGAN Anaysis

Submit to UniPept for analysis

Peak list processing A Database

generaGon B Two-‐step Database Search method C

IdenGfying pepGdes from translated cDNA db D

PepGde to GTF conversion

PepGde Summary of new proteoform pepGdes with quality pepGde spectral matching characterisGcs.

General Transfer Format file for Human genome

Pep1de to GTF

G

VisualizaGon in genomic context

PepGde Spectral Match EvaluaGon

F Filter pepGdes

with mismatches to human NCBI

database.

PepGde Summary of new proteoform pepGdes.

Peak list (mzml)

Pep1de Spectral Match Evaluator

Spectral VisualizaGon

Filtering of PepGde Spectral Matching Metrics

BLAST-‐P Analysis E

Tutorial: z.umn.edu/ppingp

Tabular Output

Input Parameters

HTML Output

PEPTIDE SPECTRAL MATCH EVALUATION GENOMIC CONTEXT VISUALIZATION

Screenshot of a novel proteoform-‐specific pep1de within Integrated Genomic Viewer. Spectral

Visualization

THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS...

Documents

Transcript of THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS...