Post on 13-Apr-2017
Scott Edmunds
publishing workflows
publishing workflowsfor
and
scott@gigasciencejournal.comORCID: 0000-0001-6444-1436
Methods
Answer
Metadata
softwareAnalysis
(Pipelines)
Idea
Study
Science & publishing pipelines 1665-2016
DataNarrative
Review
Publisher
Impact?
unFAIR things about publishing
• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
• Focus only on subjective “impact” rather than reuse.
• Lack of transparency, lack of credit for anything other than dead trees.
The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
On top of availability, data (& ROs) need to be FAIR
http://www.nature.com/articles/sdata201618
Methods
Answer
Metadata
softwareAnalysis
(Pipelines)
Idea
Study
Science & publishing pipelines >2017?
Data
Rewarding the
DOI, etc.Publication
Publication
Publication
GigaSolution: deconstructing the paper
gigadb.orgwww.gigasciencejournal.com
Utilizes big-data infrastructure and expertise from:
Combines and integrates (with DOIs):Open-access journal Data Publishing Platform
Data Analysis PlatformOpen Review Platform
gigadb.org
Publication only Full replication
Not reproducible Gold standard
Data Code and dataLinked andexecutable
code and data
Publication +
Reproducibility spectrum
Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 1226-1227.
Publication only Full replication
Not reproducible Gold standard
Data Code and dataLinked andexecutable
code and data
Publication +
Reproducibility (FAIR) spectrum
Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 1226-1227.
gigagalaxy.net
Reward Sharing of Workflows
http://gigatoolshed.net/
Reward Sharing of Workflows
Toolshed
https://academic.oup.com/gigascience/pages/galaxy_series_data_intensive_reproducible_research
Visualisations & DOIs for workflows
https://academic.oup.com/gigascience/pages/galaxy_series_data_intensive_reproducible_research
https://dx.doi.org/10.1186%2F2047-217X-3-23https://dx.doi.org/10.1186%2Fs13742-015-0060-y
Virtual Machines/containers
• Downloadable as virtual harddisk/available as Amazon Machine Image• Now publishing container (docker) submissions
Not just Genomics: Galaxy-M (Metabolomics)
https://gigascience.biomedcentral.com/articles/10.1186/s13742-016-0115-8
Now including deep integration with
Need to capture “wet” workflows (protocols)
• Create, share, modify forkeable protocols in repo.• Download & run on smartphone app.• Get discoverability, credit, DOIs for sharing methods.• Create your own, or let us set up & you claim.
https://www.protocols.io/groups/gigascience-journal
Taking a microscope to the publication process
How FAIR/reproducible are GigaScience papers?
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127612
Pilot project
How FAIR can we get?
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18>50,000 accesses& >1,000 citations
Open-Code
7 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-PipelinesOpen-Workflows
DOI:10.5524/100038Open-Data
78GB CC0 data
Code in sourceforge & Gitub under GPLv3: http://soapdenovo2.sourceforge.net/ & https://github.com/aquaskyline/SOAPdenovo2 >40,000 downloads
Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
The SOAPdenovo2 Case studySubject to and test with 3 models:
Data
Method/Experimental
protocol
Findings
Types of resources in an RO
ISA-TAB/ISA2OWL
Nanopublication
Wfdesc/ISA-TAB/ISA2OWL
Models to describe each resource type
Integration of SOAPdenovo2into GigaGalaxy
SOAPdenovo2 S. aureus pipeline
Species Tool Contigs Scaffolds
Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb)
S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342
SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078
ALL-PATHS-LG 37 149.7 13 119.0 11 1477 1 1093
R. sphaeroides SOAPdenovo1 2241 3.5 400 2.8 956 106 24 68
SOAPdenovo2 721 18 106 14.1 333 2549 4 2540
ALL-PATHS-LG 190 41.9 30 36.7 32 3191 0 0
Published and Galaxy-reproduced statistics of genome assemblies of S. aureus and R. sphaeroides
Species Tool Contigs Scaffolds
Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb)
S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342
SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078
ALL-PATHS-LG 37 149.7 13 117.6 10 1477 1 1093
R. sphaeroides SOAPdenovo1 2242 3.5 392 2.8 956 105 18 70
SOAPdenovo2 721 18 106 14.1 333 2549 4 2540
ALL-PATHS-LG 190 41.9 31 36.7 32 3191 0 3310
Publ
ishe
d Re
prod
uced
1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer.4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.
CORRECTION
http://dx.doi.org/10.1186/s13742-015-0069-2
Lessons learned from this• With enough effort is possible recreate a result from
a paper• Most published research findings are false. Or at
least have errors• Complete scientific reproduction is difficult
– Being FAIR can be COSTLY. How much are you willing to spend?
• Much easier to make things FAIR before rather than after publication.
• Finally seeing benefits (re-use/citations) from our “review on reproducibility not impact” approach
21st Century I4As• Think beyond narrative to re-use• Bake in reproducibility • Embrace new FAIR tools & models• Disseminate ALL ROs • Worth investment in moving up
reproducibility spectrum– toolshedVMs/Docker
• Remember FAIR mantra:
“The question to ask in order to be a data steward, to handle data or to simplify a set of standards is the same: “is it FAIR”?”http://www.nature.com/ng/journal/v48/n4/full/ng.3544.html
www.gigasciencejournal.com
Give us your FAIR data, workflows & papers
Help GigaPanda make it happen!
scott@gigasciencejournal.com editorial@gigasciencejournal.com database@gigasciencejournal.com
Contact us:
Thanks to:
@gigasciencefacebook.com/GigaSciencehttp://gigasciencejournal.com/blog
Peter LiChris HunterJesse Si Zhe XiaoNicole NogoyHans ZaunerLaurie Goodman
Ruibang Luo (HKU/JH)Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Oxford)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.orggigagalaxy.net
www.gigasciencejournal.com
Funding from:
team: Case study: