Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
-
Upload
gigascience-bgi-hong-kong -
Category
Science
-
view
909 -
download
0
Transcript of Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
![Page 1: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/1.jpg)
ScSc
0000-0001-6444-1436
@SCEdmunds
NEW MODEL
Open data
publishing
Scott Edmunds
Balti Bioinformatics
![Page 2: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/2.jpg)
The problems with publishing
• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
• Lack of transparency, lack of credit for anything other than 350-year old style “dead tree” publication
• Traditional publishing policies and practices a hindrance (licensing & access, embargoes, Ingelfinger, closed doors, anti-granularity & forking)
![Page 3: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/3.jpg)
The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
![Page 4: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/4.jpg)
Consequences: increasing number of retractions>15X increase in last decade
At current % > by 2045 as many papers published as retracted
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
![Page 5: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/5.jpg)
STAP paper demonstrates problems:
Nature Editorial, 2nd July 2014:
“We have concluded that we and the referees could not have detected the problems that fatally undermined the papers. The referees’ rigorous reports quite rightly took on trust what was presented in the papers.”
http://www.nature.com/news/stap-retracted-1.15488
![Page 6: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/6.jpg)
STAP paper demonstrates problems:
…to publish protocols BEFORE analysis…better access to supporting data…more transparent & accountable review
…to publish replication studies
Need:
![Page 7: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/7.jpg)
• Review• Data• Software• Models• Pipelines• Re-use…
= Credit
}
Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)
New incentives/credit
![Page 8: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/8.jpg)
Not just carrots…
“The data discovery index (DDI) enabled through bioCADDIE is to do for data what PubMed (and PubMed Central) did for the literature.”
![Page 9: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/9.jpg)
Things we need to reward
![Page 10: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/10.jpg)
Methods
Answer
Metadata
softwareAnalysis
(Pipelines)
Workflows/Environments
Idea
Study
Rewarding the
DOI, etc.Publication
Publication
Publication
Data
![Page 11: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/11.jpg)
Open peer review1. Transparency
![Page 12: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/12.jpg)
The only drawback?
End reviewer 3 Downfall parody videos, now!
1. TransparencyOpen peer review
![Page 13: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/13.jpg)
Publons + AcademicKarma = credit for reviewers efforts
http://publons.com/
1. Transparency/open peer review
http://academickarma.org/
![Page 14: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/14.jpg)
1. Transparency
Reward pre-prints
![Page 15: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/15.jpg)
http://tmblr.co/ZzXdssfOMJfy
arXiv + blogged reviews = real-time open-review
1. Transparency
![Page 16: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/16.jpg)
arXiv + blogged reviews = real-time open-review
1. Transparency
![Page 17: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/17.jpg)
2. DataReward Open Data
![Page 18: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/18.jpg)
IRRI GALAXYRice 3K project: 3,000 rice genomes, 13.4TB public data
2. (Big) Data
![Page 19: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/19.jpg)
2. DataReward Intermediate Data
![Page 20: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/20.jpg)
Nanopore MinION E. Coli genome released via GigaDB 10-Sep-2014
Curated & converted to ISA-tab, & worked with EBI to get raw data there
Data Note submitted & preprint version out 26th September
Peer reviewed & published 20th October
2. DataReward Faster Data Release
http://www.gigasciencejournal.com/content/3/1/22
![Page 21: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/21.jpg)
Real time sequencing era needs real time publication!
• Used as test data for “minoTour”: real time data analysis tools for minION data
• Nanopore data already used in (CC0 GitHub based) teaching materials
• Next stop…Erratums, Updates & more (see later)
1. mioTour http://minotour.nottingham.ac.uk/2. https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly
2. DataReward Faster Data Release
![Page 22: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/22.jpg)
OMERO: providing access to imaging data
Already used by JCB.
View, filter, measure raw images with direct links from journal article.
See all image data, not just cherry picked examples.
Download and reprocess.
2. DataReward Imaging Data
![Page 23: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/23.jpg)
The alternative...
...look but don't touch
2. DataReward Imaging Data
![Page 24: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/24.jpg)
3. Software
https://www.change.org/p/everyone-in-the-research-community-we-must-accept-that-software-is-fundamental-to-research-or-we-will-lose-our-ability-to-make-groundbreaking-discoveries
![Page 25: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/25.jpg)
galaxy.cbiit.cuhk.edu.hk
4. WorkflowsReward Sharing of Workflows
![Page 26: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/26.jpg)
Visualisations & DOIs for workflows
http://www.gigasciencejournal.com/series/Galaxy 26
![Page 27: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/27.jpg)
• Can facilitate reproducibility, reuse & sharing with tools like: Knitr, Sweave, iPython Notebook
5. Open DocumentsReward Open/Dynamic Workbooks
![Page 28: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/28.jpg)
E.g.
![Page 29: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/29.jpg)
E.g.
![Page 30: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/30.jpg)
5. Virtual Machines
?http://ivory.idyll.org/blog/vms-considered-harmful.html
![Page 31: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/31.jpg)
http://dx.doi.org/10.5524/100106http://www.gigasciencejournal.com/content/3/1/23
5. Virtual Machines
![Page 32: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/32.jpg)
Taking a microscope to the publication process
![Page 33: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/33.jpg)
33
![Page 34: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/34.jpg)
How reproducible can we get?
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18>33,000 accesses& 270 citations
Open-Code
7 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-PipelinesOpen-Workflows
DOI:10.5524/100038Open-Data
78GB CC0 data
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>36,000 downloads
Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
34
![Page 35: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/35.jpg)
Post publication: bloggers pull apart code/reviews in blogs + wiki:
SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/
Reward open & transparent review
![Page 36: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/36.jpg)
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
![Page 37: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/37.jpg)
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in our Galaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools
![Page 38: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/38.jpg)
Can we reproduce results? SOAPdenovo2 S. aureus pipeline
![Page 39: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/39.jpg)
The SOAPdenovo2 Case studySubject to and test with 3 models:
DataData
Method/Experimental protocolMethod/Experimental protocol
FindingsFindings
Types of resources in an RO
Wfdesc/ISA-TAB/ISA2OWLWfdesc/ISA-
TAB/ISA2OWL
Models to describe each resource type
See: http://biorxiv.org/content/early/2014/12/08/011973
![Page 40: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/40.jpg)
![Page 41: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/41.jpg)
1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer.4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.
![Page 42: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/42.jpg)
Lessons Learned• Most published research findings are false. Or at
least have errors
• Is possible to push button(s) & recreate a result from a paper
• Reproducibility is COSTLY. How much are you willing to spend?
• Much easier to do this before rather than after publication
![Page 43: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/43.jpg)
The cost of staying with the status quo?
• Ioannidis estimate that 85% of research resources are wasted.
• Each retraction estimated to cost $400,000.
![Page 44: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/44.jpg)
Make your data, software &
other ROs open (CC0, OSI)
Get credit for your reviewing
Publish your research objects
(with us!)
In Summary
www.gigasciencejournal.com
@gigasciencefacebook.com/GigaScience
![Page 45: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing](https://reader031.fdocuments.us/reader031/viewer/2022020307/55a44ce51a28ab9c448b4603/html5/thumbnails/45.jpg)
Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)
Thanks to:
@gigasciencefacebook.com/GigaScienceblogs.biomedcentral.com/gigablog/
Peter LiChris HunterJesse Si ZheRob DavidsonNicole NogoyLaurie GoodmanAmye Kenall (BMC)
Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Lancaster)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.orggalaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
CBIITFunding from:
Our collaborators:team: Case study:
45