Working towards multi-omics integration: Tools and...

Post on 07-Oct-2020

1 views 0 download

Transcript of Working towards multi-omics integration: Tools and...

Working towards multi-omics integration:

Tools and workflows within Galaxy-P Platform

Pratik JagtapGalaxy-P Team

University of Minnesota

December 6, 2019

Minnesota Supercomputing InstituteJames JohnsonThomas McGowanMichael Milligan

Ira Cooke and Maria DoyleMelbourne , Australia

University of Minnesota

Timothy Griffin PIPraveen KumarCandace GuerreroSubina MehtaAdrian Hegeman (Co-I)Art EschenlauerRay SajulgaCaleb EasterlyAndrew Rajczewski

Biologists / collaboratorsLaurie ParkerJoel RudneyManeesh BhargavaAmy SkubitzChris WendtBrian CrookerSteven FriedenbergKevin VikenKristin BoylanMarnie PetersonSomiah AfiuniBrian SandriAlexa PragmanWanda WeberAmy Treeful

Harald Barsnes Marc Vaudel University of Bergen, Norway

University of Freiburg,Freiburg, Germany

VIB, UGhent, Belgium

Judson HerveyNaval Research InstituteWashington, D.C.

Matt ChambersNashville, TN

Alessandro TancaPorto Conte Ricerche, Italy

Carolin KolmederUniversity of Helsinki, Finland

Thilo MuthBernhard RenardRobert Koch Institut

Thomas DoakJeremy Fisher Haixu Tang Sujun LiIndiana University

Josh EliasStanford University

Brook NunnU of Washington

Lennart Martens (Co-I)Bart MesuereRobbert G Singh

Bjoern GrueningBérénice Batut

Lloyd Smith (Co-I)Michael ShortreedUW-Madison

Anamika KrishanpalPriyabrata PanigrahiPersistent Systems Limited

Stephan KangIntero Life Sciences

FundingAcknowledgements

Magnus Øverlie ArntzenFrancesco DeloguNMBU,Oslo, Norway

galaxyp.org

Proteogenomics: A primer

+TOF MS: 24 MCA scans from Myo_tryptic.wiff Max. 5191.0 counts.

1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000m/z, amu

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5191

Inte

nsi

ty,

cou

nts

1360.7892

1606.8892

1938.0629

1815.9397

1378.8696

2316.30921506.9692

1886.06721271.6925 1661.8925

1001.4584 1983.10711589.8688

1343.7703 1798.92161071.6147 2298.26431959.0339 2505.3460 2602.5045

MS1

MS2

Matching amino acid sequences to MS/MS data

Detecting protein variants via proteogenomics

Comprehensive

Database

(Sample-specific, all

possible sequences)

UCGAUCAGGGCAAUTCGATCAGGGCAATAGCTAGTCCCGTTA

RNA sequences (e.g. RNA-seq)

(3-frame translation)

DNA sequences

(6-frame translation)

In-silico translation

Proteogenomic outcomes

Confirms translation of variants

Direct evidence of potential functional variants

Applications in neoantigendiscovery (immuno-oncology)

VOL.11 |NO.11 | NOVEMBER 2014 | NATURE METHODS

Bringing proteogenomics to the masses: informatics challenges

J. Proteome Res., 2014, 13, pp 5898–5908

• Many software tools, integration, automation….

• RNA-Seq assembly and analysis• Customized protein dB generation• Matching sequences to MS/MS data• Filtering and QC!• Interpretation! Beyond a list....

PROTEOGENOMICS & ITS CHALLENGES

Ruggles et al. Mol Cell Proteomics 2017;16:959-981

© 2017 by The American Society for Biochemistry and Molecular Biology, Inc.

Challenges• Large search database sizes• False-positive sources and their elimination. • Validation of novel peptide identification. • PSM Quality Evaluation • Targeted proteomics of identified peptides. • Genomic localization.

• Disparate tools and numerous processing steps.

Galaxy Platform

• A web-based bioinformatics data analysis platform.• Software accessibility and usability. • Share-ability of tools, workflows and histories. • Reproducibility and ability to test and compare results after using multiple parameters.• Ability to assimilate disparate software into integrated workflows.

https://galaxyproject.org/

Solution: Galaxy Platform

For example, Protein Database Downloader

downloads UniProt protein FASTA databases

of various organisms.

Software tools can be used in a sequential manner to generate analytical workflows that can be reused, shared and creatively modified.

Workflow #1

RNA-Seq to Variant

FASTA database

Proteogenomics Workflows In GalaxyRNASeq Data

GTF File

HISAT Alignment tool

FreeBayesVariant calling

CustomProDBVariant annotation & Genome mapping

StringTIERNA-Seq to Transcripts

GFF Compare Compares assembly with

annotated transcripts

Genome Mapping Files

PROTEIN SEQUENCE FASTA

10th Annual Meeting of Proteomics Society, India, 2018

UniProt FASTA

RNASeq Data

GTF FileGTF File

Proteogenomics Workflows In Galaxy

HISAT Alignment tool

FreeBayesVariant calling

CustomProDBVariant annotation & Genome mapping

StringTIERNA-Seq to Transcripts

GFF Compare Compares assembly with

annotated transcripts

PROTEIN SEQUENCE FASTA

Workflow #2

Database Searching

Using MS/MS data

RAW Files

SearchGUI and PeptideShaker

Peptides for BLAST Search

PSM Report

mz to SQLite

10th Annual Meeting of Proteomics Society, India, 2018

Proteogenomics Workflows In Galaxy

HISAT Alignment tool

FreeBayesVariant calling

CustomProDBVariant annotation & Genome mapping

StringTIERNA-Seq to Transcripts

GFF Compare Compares assembly with

annotated transcripts

SearchGUI and PeptideShaker

Peptides for BLAST Search

PSM Report

mz to SQLite

Workflow #3

Identifying Novel Variants

And Visualization

Summary of peptides

10th Annual Meeting of Proteomics Society, India, 2018

PROTEOGENOMICS WORKFLOW

Proteo-transcriptomics workflows within Galaxy are used to determine protein expression and detect variant proteins expressed.

Transcriptomics workflows within are used to generate

customized protein databases; estimate gene expression &

detect variant genes expressed.

Quantitative proteotranscriptomics

Kumar P, Panigrahi P, Johnson J, Weber WJ, Mehta S, Sajulga R, Easterly C, Crooker BA, HeydarianM, Anamika K, Griffin TJ, Jagtap P. J Proteome Res. 2019 18:782-790.

Praveen Kumar(Krishanpal Anamika/Priyabrata Panigrahi)

QuanTP: interactive visualization of RNA-protein response

Distribution

Transcriptome Data Proteome Data

QuanTP: interactive visualization of RNA-protein response

Differential Expression

18

Transcriptome Data Proteome Data

QuanTP: interactive visualization of RNA-protein response

Principal component analysis

Transcriptome Data Proteome Data

20

QuanTP: interactive visualization of RNA-protein response

Cluster Analysis

Correlation of RNASeq and proteomics data

21

QuanTP: interactive visualization of RNA-protein response

Correlation

Cook’s Distance Analysis

22

QuanTP: interactive visualization of RNA-protein response

Influential Points

Correlation of RNASeq and proteomics data

Multi-Omics Visualization Platform:

Characterizing the nature of detected variants

• HTML-based Galaxy plugin• Interactive reading of mzsqlite dB

https://www.biorxiv.org/content/10.1101/842856v2.abstract

Tom McGowan

MULTI-OMICS VISUALIZATION PLATFORM FOR

VISUALIZING NOVEL PROTEOFORMS

SPECTRAL QUALITY VISUALIZATION (Lorikeet Viewer)

GENOMIC LOCALIZATION (Integrated Genomics Viewer)

https://www.biorxiv.org/content/10.1101/842856v2.abstract

CRAVAT-P: Assessing potential impact of variants

Sajulga R, Mehta S, Kumar P, Johnson JE, Guerrero CR, Ryan MC, Karchin R, Jagtap PD, Griffin TJ. J Proteome Res. 2018 ,17:4329-4336

Cancer-Related Analysis of Variants Toolkit (cravat.us) developed by Rachel Karchin and Michael Ryan

Assessing potential impact of protein-level variants: CRAVAT-P

• Intersection of transcript variants and confirmed protein variants

Ray Sajulga

Unleashing the power of CRAVAT on proteogenomic results

Sajulga R, Mehta S, Kumar P, Johnson JE, Guerrero CR, Ryan MC, Karchin R, Jagtap PD, Griffin TJ. J Proteome Res. 2018 ,17:4329-4336

ndexbio.org

https://jraysajulga.github.io/cravatp-galaxy-docker/

• HTML-based Galaxy plugin

• Interactive viewer

COMING SOON

• PepQuery Tool uses a peptide-centric approach for validation by a) competitive filtering; b) statistical evaluation; c) unrestricted modification search and d) visualization of peptides corresponding to novel proteoforms.

Wen et al Genome Res. (2019); 29(3): 485–493. doi: 10.1101/gr.235028.118

• Extend MVP, QuanTP and CRAVAT-P tools

• Integrate newer tools from our collaborators to extend the existing workflows.

Accessing the Multi-omic Workflows

PUBLIC INSTANCES

Proteogenomics Gateway: z.umn.edu/proteogenomicsgateway

Step-by-step instructions: z.umn.edu/pginnov18

Metaproteomics Gateway: z.umn.edu/metaproteomicsgateway

Step-by-step instructions: z.umn.edu/suppS1

Tools and Workflows also available on : https://proteomics.usegalaxy.eu/

ALSO AVAILABLE ON:

GitHub: https://github.com/galaxyproteomics

Galaxy Toolshed: https://toolshed.g2.bx.psu.edu/

Docker: https://jraysajulga.github.io/cravatp-galaxy-docker/

Training Workflows also available on : https://training.galaxyproject.org

Accessing the Multi-omic Workflows

Conclusions

• Proteogenomics workflows that generate quantitative peptide and protein-level values are available within Galaxy platform.

• Post-search analysis tools such as QuanTP, MVP and CRAVAT-P help understand the biological context of the data. We plan to extend these tools.

• There is a need to integrate statistical tools and methods to offer a much more comprehensive perspective of proteogenomics data.

We can be Reached at :

Published Manuscripts: z.umn.edu/galaxypreferences

Galaxy-P Presentations: http://galaxyp.org/conference-presentations

Contact: http://galaxyp.org/contact/

Twitter: twitter.com/usegalaxyp

galaxyp.org

Acknowledgements

Funding

galaxyp.org/contact

Follow us on: twitter.com/usegalaxyp

The Galaxy-P Team at University of Minnesota