THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS...
Transcript of THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS...
THE GALAXY FRAMEWORK AS A UNIFYING BIOINFORMATICS SOLUTION FOR ‘OMICS’ CORE FACILITIES Pratik Jagtap1; James Johnson2; Bart Gottchalk2; Getiria Onsongo2; Sricharan Bandhakavi3; Joel Kooren4; Brian Sandri4; Ebbing de Jong1; Todd Markowski1; LeeAnn Higgins1; Chris Wendt4; Joel Rudney4 and Timothy Griffin4 1.Center for Mass Spectrometry and Proteomics 2. Minnesota Supercomputing Institute 3. Bio-Rad Laboratories, Richmond, CA 4. University of Minnesota, Minneapolis, MN 55455
INTRODUCTION
GALAXY-‐P Galaxy-P Workflow Galaxy-P Tools
Galaxy-P has multiple software tools - some proteomics-specific - and others from the genomics Galaxy framework.
• Integration of different ‘omics’ data reveals novel discoveries into biological systems.
• However - the need for multiple, disparate software makes the integration of multi-omic data a serious challenge.
• Extension of Galaxy (a web-based bioinformatics data analysis platform) for mass spectrometric-based proteomics software enables advanced multi-omic applications such as proteogenomics, metaproteomics and quantitative proteomics.
PROTEOGENOMICS METAPROTEOMICS QUANTITATIVE PROTEOMICS
CONCLUDING REMARKS
Salivary and OPML datasets OVERVIEW OF MODULES AND ANALYTICAL WORKFLOWS FOR METAPROTEOMIC ANALYSIS.
Tools can be used in a sequential manner to generate workflows that can be reused, shared and creatively modified for multiple studies.
Benefits of Galaxy / Galaxy-P: • Software accessibility and usability. • Share-ability of tools, workflows and histories. • Reproducibility and ability to test and compare results after using multiple parameters.
METHODS & DATASETS
> ENST00000 cdna: TACGGCCGTCGTGCCC > ENST000000 cdna: TCGTGCCGCTTACGGC
> ENST00000 protein ITSAPRTEINDATASET > ENST000000 protein INANTHERFRAMETHGH
> sp|Acc No 1|Human MANPRTEINS > sp|Acc No 2|Human MANYHMANPRTEINS
> ENST00000 protein ITSAPRTEINDATASET > ENST000000 protein INANTHERFRAMETHGH
> sp|Acc No 1|Human MANPRTEINS > sp|Acc No 2|Human MANYHMANPRTEINS
> ENST00000 protein ITSAPRTEINDATASET > ENST000000 protein INANTHERFRAMETHGH
> sp|Acc No 1|Human MANPRTEINS > sp|Acc No 2|Human MANYHMANPRTEINS
> ENST00000 protein (decoy) TESATADNIIETRPASTI > ENST000000 protein (decoy) HGHTEMARFREHTNANI
> sp|Acc No 1|Human (decoy) SNIETRPNAM > sp|Acc No 2|Human (decoy) SNIETRPNAMHYNAM
Peak list processing
Database generaGon
Two-‐step Database Search method
PTNTIALNEWPRTEFRM PEPTIDESINFSTAFRMT MICRPEPTIDES ARCHAEALPEPTIDES
> ENST| PepIdes 8-‐30 aas 1 SMALLPEPTIDES Ø ENST| PepIdes 8-‐30 aas 2 REALLYSHRTPEPTIDES
> ENST| PepIdes 31 and more aas 1 VERYLNGPEPTIDESSPERLNGPEPTIDESFREVER > ENST| PepIdes 8-‐30 aas 2 HGEPEPTIDESTHTNEEDSEARCHESWTHWRK
IdenGfying pepGdes from microbial db
A
B
Mass spectra Peak list (MGF or mzml)
msconvert MGF
Forma/er
transla1on
PepGde Summary Data
Processing
Target-‐Decoy database
Target database
PepGde Summary from Second-‐Step
Search
Data Processing
Microbial PepGdes
Short pepGdes (<30 aas)
Long pepGdes (>30 aas)
Host Protein Database
Metagenomic Database
Microbial protein db
C
D BLAST-‐P Analysis E
SECC and Salivary datasets
Results Summary
COPD dataset WORKFLOW AND TUTORIAL
• 20 KEGG pathways. • Most prevalent pathway : Carbohydrate metabolism. • ‘Best-populated’ pathway : Glycolysis (Carbohydrate metabolism). • Protein with highest number of reads: Glyceraldehyde-3-phosphate.
Dataset Total spectra
Distinct peptides of microbial origin Phyla* Genera* Species*
Whole human salivary supernatant 988,974 1926 12 65 123 SECC without sucrose 153,019 28,126 5 33 56 SECC with sucrose 139,759 23,029 5 13 33
• Analytical transparency • Scalability of data
RAW files from multiple datasets (see below) were generated from Orbitrap Velos instrument. The processed peak lists were searched using ProteinPilot ™ version 4.5 (AB Sciex) within Galaxy-P (usegalaxyp.org). After optimization and testing, multiple workflows were used in a sequential manner to generate inputs for the subsequent workflow.
METAPROTEOMICS • Severe Early Childhood Caries (SECC) dataset for clinical comparison of oral microcosm
biofilms grown from plaque either in presence or absence of sucrose. • Salivary supernatant dataset - 3D-fractionated with or without ProteoMiner treatment
(Bandhakavi et al 2009). 200 RAW files were acquired on LTQ/Orbitrap XL. Both the datasets were searched against the human oral microbiome database (HOMD) using the two-step method (Jagtap et al 2013).
PROTEOGENOMICS • Salivary supernatant - same as in metaproteomics study above. • Oral pre-malignant lesion (OPML) dataset was collected as brush biopsy sample from six
individuals with pre-malignant lesions and a matched control sample from adjacent oral cavity (Kooren et al unpublished). Both the datasets were searched against 3-frame translated cDNA database and the human oral microbial database by using two-step method (Jagtap et al 2013).
QUANTITATIVE PROTEOMICS • Chronic Obstructive Pulmonary Disease (COPD) – linked lung cancer tissue samples were
collected and subjected to iTRAQ labeling and 2D LC-MS. The dataset was searched against Human UniProt database.
OVERVIEW OF MODULES AND ANALYTICAL WORKFLOWS FOR PROTEOGENOMIC ANALYSIS.
* Analysis using MEGAN.
• Salivary ProteoGenomics : 52 novel proteoforms identified (Notably - alternate frame translation for PRB1 and PRB2 proteins
• SECC Metaproteomics: Analysis of outputs from Galaxy-P workflows and MEGAN analysis is currently ongoing for three replicates
• Quantitative proteomics : Workflows have been used on multiple replicates. Reproducibility analysis and RNASeq data integration in works.
• IMMEDIATE PLANS: - Working along with genomics research community and Genomics Center
to develop on integration of RNASeq derived workflows for database generation.
- Working closely with metagenomics / microbiome research community to develop functional pathway analytical workflows using the metaproteomics data.
- Working on correlating RNASeq quantitative information with quantitative iTRAQ proteomic information for both model and non-model organisms.
• FUTURE PLANS: - Installation and testing of open-source tools (such as MS-GF+ and
PeptideShaker). The installation and testing is being carried out through and international collaboration between developers and users.
- Improving on current metaproteomic and proteogenomic workflows.
Acknowledgements : This work was funded by NSF grant 1147079. Also many thanks to John Chilton (Penn State) for his work in the first year of the project.
Shareable workflows: z.umn.edu/peaklistconversion z.umn.edu/dbgen z.umn.edu/mn2stp z.umn.edu/peptidesfromcdnadb z.umn.edu/blastp z.umn.edu/psme z.umn.edu/pep2gtf All together : z.umn.edu/pg140
Normal coding frame AlternaGve frame
*QPPRSPRGGQ LHSPLSDSPLDPLDAG
QPQQPNGGAPPPPG
RLSSPIAEQQLHPV
QPPPGQPKGPPS*P KHPHDKHSEQLLDPV
QPPPGQPKGPPPPPGQPKNGGQPPPG
KHPHDKHSEQLHPVKLNTAEKHPHD
QPQQPNGGAPPPPG
RLSSPIAEQQLHPV
QSKNDGPPPGQPKGPPPPGQ
KPSTTEKHPHDKHSEQLHPV
QPKGPPSRSSRSKNDG KHSEQLLDLVEPSTTE
PRB1 PRB2
Human chromosome 12
11,505 kb 11,549 kb
Results Summary
Dataset Number of spectra
Novel proteoform
peptides
Novel proteoform peptides filtered after PSM evaluation
Number of distinct peptides after
visualization and for genome localization.
Salivary supernatant 988,974 254 105 52
OPML Control 156,405 904 34 17 OPML Lesion 157,299 887 29 21
A B
C D E F G
Representation of genomic organization of identified novel proteoform-specific peptides from PRB1 and PRB2 genes on chromosome 12.
Shareable workflows: z.umn.edu/peaklistconversion z.umn.edu/dbgenmp z.umn.edu/mn2stp z.umn.edu/pepfrommicrobialdb z.umn.edu/blastp All together : z.umn.edu/mp65
A
C D
B
E
> ENST|PotenGal new Microbe1 PTNTIALNEWPRTEFRM > ENST|PotenGal new Microbe2 PEPTIDESINFSTAFRMT
PepGdes in FASTA format
Submit for MEGAN Anaysis
Submit to UniPept for analysis
Peak list processing A Database
generaGon B Two-‐step Database Search method C
IdenGfying pepGdes from translated cDNA db D
PepGde to GTF conversion
PepGde Summary of new proteoform pepGdes with quality pepGde spectral matching characterisGcs.
General Transfer Format file for Human genome
Pep1de to GTF
G
VisualizaGon in genomic context
PepGde Spectral Match EvaluaGon
F Filter pepGdes
with mismatches to human NCBI
database.
PepGde Summary of new proteoform pepGdes.
Peak list (mzml)
Pep1de Spectral Match Evaluator
Spectral VisualizaGon
Filtering of PepGde Spectral Matching Metrics
BLAST-‐P Analysis E
Tutorial: z.umn.edu/ppingp
Tabular Output
Input Parameters
HTML Output
PEPTIDE SPECTRAL MATCH EVALUATION GENOMIC CONTEXT VISUALIZATION
Screenshot of a novel proteoform-‐specific pep1de within Integrated Genomic Viewer. Spectral
Visualization