Post on 15-Jul-2015
@SCEdmunds@GigaScience
@OpenData_HK
Publishing in the Open Data Era
Hackerspace.sg, 23rdMarch 2015
Open, Crowdsource and Blockchain Science!
1665-2015 style publication
Buckheit & Donoho: Scholarly articles are merely advertisement of scholarship. The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible.
2015 style publication: problems• Article structure & journal policies (Ingelfinger, etc.) prevents
transparency, dissuades sharing of data & methods
• Lack of reproducibility is the norm
• Ioannidis: “an estimated 85% of research resources are wasted”1
• Exponential increase in number of retractions2
1. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.10017472. https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
2015 style publication: publish or perish
1. http://www.dcscience.net/2014/12/01/publish-and-perish-at-imperial-college-london-the-death-of-stefan-grimm/2. http://www.timeshighereducation.co.uk/news/imperial-college-professor-stefan-grimm-was-given-grant-income-target/2017369.article
2015 style publication: publish or perish
Attempts to “game the peer-review system on an industrial
scale”
1. http://www.scientificamerican.com/article/for-sale-your-name-here-in-a-prestigious-science-journal/2. http://www.grassley.senate.gov/sites/default/files/about/upload/Senator-Grassley-Report.pdf
Companies offering authorship of papers made to order by “paper mills”1. Common ghostwriting medical papers by pharma2
Guaranteed publication in JIF journal, often using fake referees, ID theft, etc.
Publishing: more profitable than gold
See: http://alexholcombe.wordpress.com/2013/01/09/scholarly-publishers-and-their-high-profits/
Increasing strain on library budgets
-50%
0%
50%
100%
150%
200%
250%
300%
350%
400%
1986 1988 1990 1992 1994 1996 1998 2000 2002 2004
Perc
enta
ge C
hange
Year
MIT library purchases v inflation 1986-2006
Consumer Price Index % + Serial Expenditures % + # Serials Purchased % +
# Books Purchased % + Book Expenditures % +
Journal expenditure
Inflation
Changing the rules of the game to:
• Credit open data & usable things
• Time to rethink of reuse beyond just academia (citation) and industry (IP)
• Time to rethink funding models
• Need different shaped
Data + DOI
Broad definition of open data
http://biology.clc.uc.edu/fankhauser/labs/genetics/dna_isolation/thymus_dna.htm
Open Science Data = Open DataAs is Open Access, Open Hardware, Open Environmental Data, Open Scholarship…
IRRI GALAXY
Beneficiaries of the genomics revolution?Rice 3K project: 3,000 rice genomes, 13.4TB public data
Why Open Science Data is the most
important open data* *(I may be biased though)
Climate change, global hunger, pollution, radioactivity,
cancer, disease outbreaks…
http://www.nature.com/news/data-sharing-make-outbreak-research-open-access-1.16966
To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001
Our crowdsourcing example:
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
Downstream consequences:
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”
1. Citations (>200) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science
“While the result of this just-for-the-fun-of-it exercise isn’t a cure for the superbug, the neat thing about living here in The Future is that just a few days after an outbreak of a deadly disease halfway across the world, the sequence of the pathogen is available for download — and with free, open tools anyone can perform a simple analysis. This is a nascent, but promising, technology ecosystem.”
1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-
intestinal infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths. All
tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed by
scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.
The “Peoples Parrot”Puerto Rican Parrot Genome Project (Amazona vittata )
Rarest parrot, national bird of Puerto Rico
Community funded from artworks, fashion shows, beer brands, crowdfunding…
Genome annotated by students in community college as part of bioinformatics education
Paper and Data published in GigaScience and GigaDB
Taras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances Young Researcher Education. GigaScience 2012, 1:14Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience. http://dx.doi.org/10.5524/100039
The next step: home sequencing?Oxford Nanopore MinIONThe Startrek Tricorder is here!
Build your own genomes
Amplicon sequencing in 6hrs• Sequencer in my toilet/boat/drone• “who did I get ebola from”?
Metabarcoding• eDNA & biodiversity research• Are Pangolins still in Hong Kong?• Is this salamander still in this
stream?• Is there a new species in my back
garden?
Broad definition of open data
http://biology.clc.uc.edu/fankhauser/labs/genetics/dna_isolation/thymus_dna.htm
Revisited…