Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

31
Scott Edmunds publishing workflows publishing workflows for and [email protected] ORCID: 0000-0001-6444-143

Transcript of Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Page 1: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Scott Edmunds

publishing workflows

publishing workflowsfor

and

[email protected]: 0000-0001-6444-1436

Page 2: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Methods

Answer

Metadata

softwareAnalysis

(Pipelines)

Idea

Study

Science & publishing pipelines 1665-2016

DataNarrative

Review

Publisher

Impact?

Page 3: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

unFAIR things about publishing

• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

• Focus only on subjective “impact” rather than reuse.

• Lack of transparency, lack of credit for anything other than dead trees.

Page 4: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

The consequences: growing replication gap

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 5: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

On top of availability, data (& ROs) need to be FAIR

http://www.nature.com/articles/sdata201618

Page 6: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows
Page 7: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Methods

Answer

Metadata

softwareAnalysis

(Pipelines)

Idea

Study

Science & publishing pipelines >2017?

Data

Rewarding the

DOI, etc.Publication

Publication

Publication

Page 8: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

GigaSolution: deconstructing the paper

gigadb.orgwww.gigasciencejournal.com

Utilizes big-data infrastructure and expertise from:

Combines and integrates (with DOIs):Open-access journal Data Publishing Platform

Data Analysis PlatformOpen Review Platform

Page 9: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

gigadb.org

Page 10: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Publication only Full replication

Not reproducible Gold standard

Data Code and dataLinked andexecutable

code and data

Publication +

Reproducibility spectrum

Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 1226-1227.

Page 11: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Publication only Full replication

Not reproducible Gold standard

Data Code and dataLinked andexecutable

code and data

Publication +

Reproducibility (FAIR) spectrum

Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 1226-1227.

Page 12: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

gigagalaxy.net

Reward Sharing of Workflows

Page 13: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

http://gigatoolshed.net/

Reward Sharing of Workflows

Toolshed

Page 15: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Visualisations & DOIs for workflows

https://academic.oup.com/gigascience/pages/galaxy_series_data_intensive_reproducible_research

Page 16: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

https://dx.doi.org/10.1186%2F2047-217X-3-23https://dx.doi.org/10.1186%2Fs13742-015-0060-y

Virtual Machines/containers

• Downloadable as virtual harddisk/available as Amazon Machine Image• Now publishing container (docker) submissions

Page 17: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Not just Genomics: Galaxy-M (Metabolomics)

https://gigascience.biomedcentral.com/articles/10.1186/s13742-016-0115-8

Page 18: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Now including deep integration with

Need to capture “wet” workflows (protocols)

• Create, share, modify forkeable protocols in repo.• Download & run on smartphone app.• Get discoverability, credit, DOIs for sharing methods.• Create your own, or let us set up & you claim.

https://www.protocols.io/groups/gigascience-journal

Page 19: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Taking a microscope to the publication process

How FAIR/reproducible are GigaScience papers?

Page 20: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127612

Pilot project

Page 21: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

How FAIR can we get?

Data sets

Analyses

Linked to

Linked to

DOI

DOI

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18>50,000 accesses& >1,000 citations

Open-Code

7 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-PipelinesOpen-Workflows

DOI:10.5524/100038Open-Data

78GB CC0 data

Code in sourceforge & Gitub under GPLv3: http://soapdenovo2.sourceforge.net/ & https://github.com/aquaskyline/SOAPdenovo2 >40,000 downloads

Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2

Page 22: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

The SOAPdenovo2 Case studySubject to and test with 3 models:

Data

Method/Experimental

protocol

Findings

Types of resources in an RO

ISA-TAB/ISA2OWL

Nanopublication

Wfdesc/ISA-TAB/ISA2OWL

Models to describe each resource type

Page 23: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Integration of SOAPdenovo2into GigaGalaxy

Page 24: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

SOAPdenovo2 S. aureus pipeline

Page 25: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Species Tool Contigs Scaffolds

Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb)

S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342

SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078

ALL-PATHS-LG 37 149.7 13 119.0 11 1477 1 1093

R. sphaeroides SOAPdenovo1 2241 3.5 400 2.8 956 106 24 68

SOAPdenovo2 721 18 106 14.1 333 2549 4 2540

ALL-PATHS-LG 190 41.9 30 36.7 32 3191 0 0

Published and Galaxy-reproduced statistics of genome assemblies of S. aureus and R. sphaeroides

Species Tool Contigs Scaffolds

Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb)

S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342

SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078

ALL-PATHS-LG 37 149.7 13 117.6 10 1477 1 1093

R. sphaeroides SOAPdenovo1 2242 3.5 392 2.8 956 105 18 70

SOAPdenovo2 721 18 106 14.1 333 2549 4 2540

ALL-PATHS-LG 190 41.9 31 36.7 32 3191 0 3310

Publ

ishe

d Re

prod

uced

Page 26: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows
Page 27: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer.4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.

CORRECTION

http://dx.doi.org/10.1186/s13742-015-0069-2

Page 28: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Lessons learned from this• With enough effort is possible recreate a result from

a paper• Most published research findings are false. Or at

least have errors• Complete scientific reproduction is difficult

– Being FAIR can be COSTLY. How much are you willing to spend?

• Much easier to make things FAIR before rather than after publication.

• Finally seeing benefits (re-use/citations) from our “review on reproducibility not impact” approach

Page 29: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

21st Century I4As• Think beyond narrative to re-use• Bake in reproducibility • Embrace new FAIR tools & models• Disseminate ALL ROs • Worth investment in moving up

reproducibility spectrum– toolshedVMs/Docker

• Remember FAIR mantra:

“The question to ask in order to be a data steward, to handle data or to simplify a set of standards is the same: “is it FAIR”?”http://www.nature.com/ng/journal/v48/n4/full/ng.3544.html

Page 30: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

www.gigasciencejournal.com

Give us your FAIR data, workflows & papers

Help GigaPanda make it happen!

[email protected] [email protected] [email protected]

Contact us:

Page 31: Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

Thanks to:

@gigasciencefacebook.com/GigaSciencehttp://gigasciencejournal.com/blog

Peter LiChris HunterJesse Si Zhe XiaoNicole NogoyHans ZaunerLaurie Goodman

Ruibang Luo (HKU/JH)Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Oxford)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)

www.gigadb.orggigagalaxy.net

www.gigasciencejournal.com

Funding from:

team: Case study: