Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

21
(“Mo Data Mo Problems”) Scott Edmunds Difficulties, Data Citation DOIs and, Data Dissemination:

description

Scott Edmunds announcing BGI's new GigaScience journal at the 1st Earth Microbiome Project meeting in Shenzhen

Transcript of Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

Page 1: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

(“Mo Data Mo Problems”)

Scott Edmunds

Difficulties, Data Citation, DOIs and,

Data Dissemination:

Page 2: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

Duplicated genes most responsive to ecological challenges

The Ecoresponsive Genome of Daphnia pulex Colbourne et al., Science 4 February 2011:

200Mb Genome, 30,907 genes

Page 3: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

Daphnia Genome Consortium

wFleabase: Mar 2006Genome release: July 2007

Genome Published: Feb 2011

https://daphnia.cgb.indiana.edu/Publications

>58 companion papers

Page 4: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

Difficulties

Flickr cc: opensourceway

Page 5: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

~100,000X

Sequencing cost ($ per Mbp)

Moore’s Law

Sequencing

Source: E Lander/Broad

Page 6: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

Sequencing Output

Data

Moore’s/Kryders Law

Storage

Page 7: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

Sequencing Output

Data

Dissemination?

Publication

Page 8: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

1 Illumina HiSeq 2000 (+Truseq upgrade)

= 600Gb/run (12 days)

X 128 Hiseq = 6Tb/day = >2Pb/year

= ~ 2000 Human Genomes/day

Potential sequencing capacity

Page 9: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

SRA Closure

Page 10: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)

Prepublication data sharing (Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets.” Nature 461, 168-170 (2009)

Page 11: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

Datacitation: Datacite and DOIs

Digital Object Identifiers (DOIs) offer a solution

Mostly widely used identifier for scientific articles

Researchers, authors, publishers know how to use them

Put datasets on the same playing field as articles

DatasetYancheva et al (2007). Analyses on sediment of Lake Maar. PANGAEA.doi:10.1594/PANGAEA.587840

Page 12: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

Datacitation: Datacite and DOIs

>1 million DOIs since Dec 2009

Central metadata repository to link with WoS/ISI

- finally can track and credit use!

Page 13: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

www.gigasciencejournal.com

Large-Scale Data Journal/Database

Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhD

In conjunction with:

Coming soon…

Page 14: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

www.gigasciencejournal.com

Criteria and Focus of Journal/DatabaseReproducibility/ReuseUtility/UsabilityStandards/Searchability/Scale/SharingData publishing/DOI

Page 15: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

www.gigasciencejournal.com

Use of Data = Importance + Usability

easier to assesssubjective?

Page 16: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

www.gigasciencejournal.com

Reproducibility/Reuse BGI Cloud Computing resources for handling and analyzing large-scale data.Integrated tools to promote more widespread access, viewing, and analysis of data.Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files).

Page 17: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

www.gigasciencejournal.com

Standards/Searchability/Sharing ISA-Tab compatibility to aid and promote best practice in metadata reporting.All supporting data must be publically available.Ask for MIBBI compliance and use of reporting checklists.Part of the Biosharing network.

Page 18: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

www.gigasciencejournal.com

Data publishing/DOINew journal format combines standard manuscript publication with an extensive database to host all associated data. Data hosting will follow standard funding agency and community guidelines.DOI assignment available for submitted data to allow ease of finding and citing datasets, as well as for citation tracking.

Page 19: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

To maximize its utility to the research community and aid those  fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our first DOI:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Page 20: Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and GigaScience)

E. Coli #crowdsourcing: the first tweenome?