Scott Edmunds: Publishing in the Open Data Era, talk at Hackerspace.sg

41
@SCEdmunds @GigaScience @OpenData_HK Publishing in the Open Data Era Hackerspace.sg, 23 rd March 2015 Open, Crowdsource and Blockchain Science!

Transcript of Scott Edmunds: Publishing in the Open Data Era, talk at Hackerspace.sg

@SCEdmunds@GigaScience

@OpenData_HK

Publishing in the Open Data Era

Hackerspace.sg, 23rdMarch 2015

Open, Crowdsource and Blockchain Science!

1665-2015 style publication

Buckheit & Donoho: Scholarly articles are merely advertisement of scholarship. The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible.

2015 style publication: problems• Article structure & journal policies (Ingelfinger, etc.) prevents

transparency, dissuades sharing of data & methods

• Lack of reproducibility is the norm

• Ioannidis: “an estimated 85% of research resources are wasted”1

• Exponential increase in number of retractions2

1. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.10017472. https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

2015 style publication: publish or perish

1. http://www.dcscience.net/2014/12/01/publish-and-perish-at-imperial-college-london-the-death-of-stefan-grimm/2. http://www.timeshighereducation.co.uk/news/imperial-college-professor-stefan-grimm-was-given-grant-income-target/2017369.article

2015 style publication: publish or perish

Attempts to “game the peer-review system on an industrial

scale”

1. http://www.scientificamerican.com/article/for-sale-your-name-here-in-a-prestigious-science-journal/2. http://www.grassley.senate.gov/sites/default/files/about/upload/Senator-Grassley-Report.pdf

Companies offering authorship of papers made to order by “paper mills”1. Common ghostwriting medical papers by pharma2

Guaranteed publication in JIF journal, often using fake referees, ID theft, etc.

Publishing: more profitable than gold

See: http://alexholcombe.wordpress.com/2013/01/09/scholarly-publishers-and-their-high-profits/

Increasing strain on library budgets

-50%

0%

50%

100%

150%

200%

250%

300%

350%

400%

1986 1988 1990 1992 1994 1996 1998 2000 2002 2004

Perc

enta

ge C

hange

Year

MIT library purchases v inflation 1986-2006

Consumer Price Index % + Serial Expenditures % + # Serials Purchased % +

# Books Purchased % + Book Expenditures % +

Journal expenditure

Inflation

http://goo.gl/zUDEC9

Changing the rules of the game to:

• Credit open data & usable things

• Time to rethink of reuse beyond just academia (citation) and industry (IP)

• Time to rethink funding models

• Need different shaped

Data + DOI

Feeding the Commons:Digitizing the world

Can we make everything open data?

Broad definition of open data

http://biology.clc.uc.edu/fankhauser/labs/genetics/dna_isolation/thymus_dna.htm

Open Science Data = Open DataAs is Open Access, Open Hardware, Open Environmental Data, Open Scholarship…

IRRI GALAXY

Beneficiaries of the genomics revolution?Rice 3K project: 3,000 rice genomes, 13.4TB public data

Why Open Science Data is the most

important open data* *(I may be biased though)

Climate change, global hunger, pollution, radioactivity,

cancer, disease outbreaks…

http://www.nature.com/news/data-sharing-make-outbreak-research-open-access-1.16966

To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our crowdsourcing example:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Downstream consequences:

“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”

1. Citations (>200) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons

4. Example for faster & more open science

“While the result of this just-for-the-fun-of-it exercise isn’t a cure for the superbug, the neat thing about living here in The Future is that just a few days after an outbreak of a deadly disease halfway across the world, the sequence of the pathogen is available for download — and with free, open tools anyone can perform a simple analysis. This is a nascent, but promising, technology ecosystem.”

1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfully

illustrated by events following an outbreak of a severe gastro-

intestinal infection in Hamburg in Germany in May 2011. This

spread through several European countries and the US,

affecting about 4000 people and resulting in over 50 deaths. All

tested positive for an unusual and little-known Shiga-toxin–

producing E. coli bacterium. The strain was initially analysed by

scientists at BGI-Shenzhen in China, working together with

those in Hamburg, and three days later a draft genome was

released under an open data licence. This generated interest

from bioinformaticians on four continents. 24 hours after the

release of the genome it had been assembled. Within a week

two dozen reports had been filed on an open-source site

dedicated to the analysis of the strain. These analyses

provided crucial information about the strain’s virulence and

resistance genes – how it spreads and which antibiotics are

effective against it. They produced results in time to help

contain the outbreak. By July 2011, scientists published papers

based on this work. By opening up their early sequencing

results to international collaboration, researchers in Hamburg

produced results that were quickly tested by a wide range of

experts, used to produce new knowledge and ultimately to

control a public health emergency.

https://apps.facebook.com/fraxinusgame/

Biggest crowdfunding successes

The “Peoples Parrot”Puerto Rican Parrot Genome Project (Amazona vittata )

Rarest parrot, national bird of Puerto Rico

Community funded from artworks, fashion shows, beer brands, crowdfunding…

Genome annotated by students in community college as part of bioinformatics education

Paper and Data published in GigaScience and GigaDB

Taras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances Young Researcher Education. GigaScience 2012, 1:14Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience. http://dx.doi.org/10.5524/100039

The “Cyber Slug”

Community genomics comes of age: “Crowdfernding”, the “Community Cactus” and more…

The next step: home sequencing?Oxford Nanopore MinIONThe Startrek Tricorder is here!

Build your own genomes

Amplicon sequencing in 6hrs• Sequencer in my toilet/boat/drone• “who did I get ebola from”?

Metabarcoding• eDNA & biodiversity research• Are Pangolins still in Hong Kong?• Is this salamander still in this

stream?• Is there a new species in my back

garden?

“Independent Researchers”: ancestry hacking

Broad definition of open data

http://biology.clc.uc.edu/fankhauser/labs/genetics/dna_isolation/thymus_dna.htm

Revisited…

NO

NO

The (non-) human centipede: first sequence

NO

NO

Volumetric Printing: http://www.lookingglassfactory.com/

NO

NO

http://www.thingiverse.com/thing:25695

Gamification/utilizing students: iGEM

iGEM:

http://2011.igem.org/Team:UC_Davis

Open data: reading & writing DNA

What could we be doing with open science data?

Mojave Solar Farms v Desert Tortoise

What could we be doing with open science data?

What could we be doing with open science data?

Hong Kong-Zhuhai-Macau Bridge v Pink Dolphins

To summarize:

• 350 year old communication systems no longer fit for purpose

• Need to move from producing dead trees to feeding the commons

• Need to think of new sources of funding to do this

• Open data is more than just government data

• Open data is more than just digital data