Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility...

41
Open Data: the reproducibility crisis, and the need for transparency. Scott Edmunds G3 workshop 19 th September 2014 0000-0001-6444-1436

description

Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency. Melbourne University 19th September 2014

Transcript of Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility...

Page 1: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Open Data: the reproducibility crisis, and the need for transparency.

Scott EdmundsG3 workshop19th September 2014

0000-0001-6444-1436

Page 2: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Being able to read things only 1st stepDead trees not fit for purpose

18121665 1869

Page 3: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

The problems with publishing

• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

• Lack of transparency, lack of credit for anything other than “regular” dead tree publication.

• If there is interest in data, only to monetise & re-silo

• Traditional publishing policies and practices a hindrance

Page 4: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Growing problem…

…loss of confidence in research

Page 5: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

The Cost of Scientific Retractions

Page 6: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

The consequences: growing replication gap

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 7: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Consequences: increasing number of retractions>15X increase in last decade

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Page 8: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Consequences: increasing number of retractions>15X increase in last decade

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

At current % > by 2045 as many papers published as retracted

Page 9: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Consequences: growing replication gap

1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

Insufficient methods

Page 10: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

STAP paper demonstrates problems:

…to publish protocols BEFORE analysis…better access to supporting data…more transparent & accountable review

…to publish replication studies

Need:

Page 11: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Anatomy of a Dead Tree Publication

Data

Idea

Study

Analysis

Answer

Metadata

Page 12: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Anatomy of an (Open) Data Publication

Data

Idea

Study

Analysis

Answer

Metadata

Page 13: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

What is Open (Science) Data?

• Free & open access to data about the world around us:o Searchable, findableo Machine-readable, app-makeable, Excel-usableo Without restrictions/limitations

http://science.okfn.org/

Page 14: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Panton Principles

http://pantonprinciples.org/

=

Page 15: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Sharing aids individuals…

Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308

Sharing Detailed Research Data Is Associated with Increased Citation Rate.

Every 10 datasets collected contributes to at least 4 papers in the following 3-years.Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Page 16: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Rice v Wheat: consequences of publically available genome data.

Sharing aids specific communities…

Papers

Page 17: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

• Data• Software• Review• Re-use…

= Credit

}

Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)

New incentives/credit

Page 18: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Rewarding open data

Page 19: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Cloud solutions?

Reward better handling of metadata…Novel tools/formats for data interoperability/handling.

Page 20: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Lowering barriers: data-athonsDTL/ELIXIR-NL

“Bring Your Own Data Party”GigaScience/BGI HK

Metabolomics ISA-TAB athon v

Page 21: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

IRRI GALAXY

Beneficiaries/users of our work

Page 22: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

IRRI GALAXYRice 3K project: 3,000 rice genomes, 13.4TB public data

Beneficiaries/users of our work

Page 23: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

To maximize its utility to the research community and aid those  fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our first DOI:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Page 24: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.
Page 25: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.
Page 26: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.
Page 27: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Downstream consequences:

“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”

1. Citations (~240) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons

4. Example for faster & more open science

Page 28: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.
Page 29: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfully illustrated by events following an outbreak of a severe gastro-intestinal infection in Hamburg in Germany in May 2011. This spread through several European countries and the US, affecting about 4000 people and resulting in over 50 deaths. All tested positive for an unusual and little-known Shiga-toxin–producing E. coli bacterium. The strain was initially analysed by scientists at BGI-Shenzhen in China, working together with those in Hamburg, and three days later a draft genome was released under an open data licence. This generated interest from bioinformaticians on four continents. 24 hours after the release of the genome it had been assembled. Within a week two dozen reports had been filed on an open-source site dedicated to the analysis of the strain. These analyses provided crucial information about the strain’s virulence and resistance genes – how it spreads and which antibiotics are effective against it. They produced results in time to help contain the outbreak. By July 2011, scientists published papers based on this work. By opening up their early sequencing results to international collaboration, researchers in Hamburg produced results that were quickly tested by a wide range of experts, used to produce new knowledge and ultimately to control a public health emergency.

Page 30: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

http://dx.doi.org/10.5524/100102

The first public Nanopore dataset released 10-Sep-2014

Curated with sample details and converted to ISA-tab

second

Page 31: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

The other challenge: transparency, accountability, credit

Page 32: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

More transparency: open peer review

BMC Series Medical Journals

Page 33: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Reward open & transparent review

End reviewer 3 Downfall parody videos, now!

Page 34: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

More transparency (and speed):pre-prints

Page 35: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

More transparency (and speed):pre-prints

1. http://www.nature.com/news/preprints-come-to-life-1.14140

Page 36: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

GigaScience + Publons = further credit for reviewers efforts

Reward open & transparent review

http://publons.com/

Page 37: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

GigaScience + AcademicKarma = even more credit

Reward faster review

http://academickarma.org/

Page 38: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Real-time open-review = paper in arXiv + blogged reviews

Reward open & transparent review

http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10

Page 39: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Real-time open-review = paper in arXiv + blogged reviews

Reward open & transparent review

Page 40: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Real-time open-review = paper in arXiv + blogged reviews

Reward open & transparent review

(Assemblathon ‘publish for free’ contest: [email protected])

Page 41: Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Make your data open (CC0)

Metadata, metadata, metadata

Get credit for your reviewing

Use pre-prints

Publish your data with us

In Summary

[email protected]

www.gigasciencejournal.com

@gigasciencefacebook.com/GigaScience