Scott Edmunds: Data Dissemination in the era of "Big-Data"

Scott Edmunds

Data dissemination in the era of “big data”

www.gigasciencejournal.com

William Gibson: "Information is the currency of the future world”

Sir Tim Berners-Lee: "Data is a precious thing and will last longer than the systems themselves”

Bio-IT World Asia Meeting, 7th June 2012

Is data “the new oil”?

Data Bonanza?

Data Deluge?

1.2 zettabytes (1021) of electronic data generated each year1

1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.

Global Sequencing Capacity

Data Production 5.6 Tb / day

> 1500X of human genome / day

Multiple Supercomputing Centers 157 TB Flops

20 TB Memory

14.7 PB Storage

BGI Sequencing Capacity

Data Production 5.6 Tb / day

> 1500X of human genome / day

Multiple Supercomputing Centers 157 TB Flops

20 TB Memory

14.7 PB Storage

137

Sequencers137 Illumina/HiSeq 200027 LifeTech/SOLiD 41 454 GS FLX+2 Illumina iScan1 Illumina MiSeq1 Ion Torrent


Large-Scale Data: Journal/Database/Platform

Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhDLead BioCurator: Tam Sneddon, DphilData Platform: Peter Li, PhD

In conjunction with:

Now taking submissions…

Data-data everywhere?

Data Silo’s

©$

InteroperabilityPaywalls

Metadata

?

There are many hurdles…

?

Technical: too large volumes too heterogeneous

no home for many data typestoo time consuming

Cultural: inertiano incentives to share unaware of how

There are many hurdles…

Technical challenges…

Cloud solutions?

Better handling of metadata…

Novel tools/formats for data interoperability/handling.

Data quality assessment

Tools making work more easily reproducible…

WorkflowsInteroperability/Ease of use


Cloud?

More efficient handling of data…

Do we need to keep everything?

Compression?


Cultural challenges…

Data Re-use

($)

Effort

Usability

Need to lower the hurdles…

($)

Effort

Usability

Better incentives?

($)

Effort

Usability

Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)

Prepublication data sharing (Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets.” Nature 461, 168-170 (2009)

Datacitation: Datacite and DOIsDigital Object Identifiers (DOIs)

offer a solution

Mostly widely used identifier for scientific articles

Researchers, authors, publishers know how to use them

Put datasets on the same playing field as articles

DatasetYancheva et al (2007). Analyses on sediment of Lake Maar. PANGAEA.doi:10.1594/PANGAEA.587840

“increase acceptance of research data as legitimate, citable contributions to the scholarly record”.

Aims to:

“data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.

http://dx.doi.org/10.1594/PANGAEA.587840

Datacitation: Datacite and DOIsCentral metadata repository:

• >1 million entries to date

• Stability

• Data discoverability

• Open & harvestable

• Potential to track & credit use


Data publishing/DOINew journal format combines standard manuscript publication with an extensive database to host all associated data, and integrated tools. Data hosting will follow standard funding agency and community guidelines.DOI assignment available for submitted data to allow ease of finding and citing datasets, as well as for citation tracking.

www.gigaDB.org

Data Publishing

BGI Datasets Get DOI®s

doi:10.5524/100004

PLANTSChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum

MicrobeE. Coli O104:H4 TY-2482

Cell-LineChinese Hamster Ovary

Human Asian individual (YH) - DNA Methylome - Genome Assembly- TranscriptomeCancer (14TB)Ancient DNA - Saqqaq Eskimo - Aboriginal Australian

VertebratesGiant panda Macaque - Chinese rhesus - Crab-eatingMini-PigNaked mole rat Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope

InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSchistosomaSilkworm

Many released pre-publication…

For data citation to work, needs:

• Proven utility/potential user base.

• Acceptance/inclusion by journals.

• Data+Citation: inclusion in the references.

• Tracking by citation indexes.

• Usage of the metrics by the community…

Data+Citation: inclusion in the references

• Data submitted to NCBI databases:

• Submission to public databases complemented by its citable form in GigaDB (doi:10.5524/100012).

- Raw data SRA:SRA046843 - Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000 - SNPs dbSNP:1056306 - CNVs- InDels dbVAR:nstd63 - SV

}

http://dx.doi.org/10.5524/100012

http://dx.doi.org/10.5524/100012

http://www.ncbi.nlm.nih.gov/sra/?term=SRA046843

http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewBatch.cgi?sbid=1056306

In the references…

Is the DOI…

And now in Nature Biotech…

Datacitation: tracking?

Plans in 2012 to link central metadata repository with WoS

- Will finally track and credit use!

To be continued…

DataCite metadata in harvestable form (OAI-PMH)

Final step: open licensing

To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our first DOI:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

http://dx.doi.org/10.5524/100001

http://dx.doi.org/10.5524/100001

“The way that the genetic data of the 2011 E. coli strain were disseminated globally suggests a more effective approach for tackling public health problems. Both groups put their sequencing data on the Internet, so scientists the world over could immediately begin their own analysis of the bug's makeup. BGI scientists also are using Twitter to communicate their latest findings.”

“German scientists and their colleagues at the Beijing Genomics Institute in China have been working on uncovering secrets of the outbreak. BGI scientists revised their draft genetic sequence of the E. coli strain and have been sharing their data with dozens of scientists around the world as a way to "crowdsource" this data. By publishing their data publicy and freely, these other scientists can have a look at the genetic structure, and try to sort it out for themselves.”

Downstream consequences:

“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”

1. Therapeutics (primers, antimicrobials) 2. Platform Comparisons (Loman et al., Nature Biotech 2012)

3. Speed/legal-freedom

The era of the data consumer?


?


?

Free access to data – but analysis hubs/nodes for will form around it

Data Modeling

Pipeline design

Validation

Commercial applications

Genomic Data Submission and Analytical platform

Big data from the

“Sequencing Oil Field”

GDSAP:

Data, Data, Data…

Tin-Lap Lee, CUHK

“Apps”


GDSAP:


GDSAP:

mirror/open platform

Papers in the era of big-data

To review: (>6TBp, >1500 datasets)

S3 = $15,000

EC2 (BLASTx) = $500,000

$1000 genome = million $ peer-review?

Source: Folker Meyer/Wilkening et al. 2009, CLUSTER'09. IEEE International Conference on Cluster Computing and Workshops

Papers in the era of big-data

Analysis Data

Tools/Workflows

Compute

goal: Executable Research Objects

Citable DOI

Papers in the era of big-datagoal: Executable Research Objects

Stage 1: Wilson GA, Dhami P, Feber A, Cortázar D, Suzuki Y, Schulz R, Schär P, Beck S: Resources for methylome analysis suitable for gene knockout studies of potential epigenome modifiers. GigaScience 2012, 1:3. (in press)

GigaDB hosting all data + tools (84GB total): doi:10.5524/100035+

Partial (~80%) integration of workflow into our data platform.(all the data processing steps, but not the enrichment analysis)

Stage 2: Papers fully integrating all data + all workflows in our platform.

http://dx.doi.org/10.5524/100035

http://dx.doi.org/10.5524/100035

Papers in the era of big-dataInterested in Reproducible Research?

Take part in our session on: “Cloud and workflows for reproducible bioinformatics”

• Rapid review/Open Access/High-visibility• Article Processing Charge covered by BGI• Hosting of any test datasets/workflows in GigaDB

Submit to:


Thanks to:

[email protected]

[email protected]

@gigascience

facebook.com/GigaScience

blogs.openaccesscentral.com/blogs/gigablog/

Contact us:

Follow us:

Laurie Goodman Alexandra BasfordTam Sneddon Peter Li Tin-Lap Lee (CUHK) Qiong Luo (HKUST)

mailto:[email protected]



Scott Edmunds: Data Dissemination in the era of "Big-Data"

Technology

Transcript of Scott Edmunds: Data Dissemination in the era of "Big-Data"