Data enters Scholarly Communication; how publishers can help make things better Integration of...

18
Data enters Scholarly Communication; how publishers can help make things better Integration of Research Data and Publications Project ODE – workpackage 4 Eefke Smit International Association of STM Publishers Director, Standards and Technology LONDON, ANNUAL APA CONFERENCE, 9 November 2011

Transcript of Data enters Scholarly Communication; how publishers can help make things better Integration of...

Data enters Scholarly Communication;how publishers can help make things better

Integration of Research Data and Publications

Project ODE – workpackage 4

Eefke Smit

International Association of STM Publishers

Director, Standards and Technology

LONDON, ANNUAL APA CONFERENCE, 9 November 2011

A famous paper in Nature:DNA structure - 1953

• 1 page• 2 authors• 1 figure• no data

Source: V. Kiermer, Nature Publishing Group, 2011

Nature in 2001: The human genome issue • 62 pages, 49 figures, 27 tables

Source: V. Kiermer, Nature Publishing Group, 2011

The human genome at 10 – 2010Nature now in an iPad edition:

Source: V. Kiermer, Nature Publishing Group, 2011

A thousand genomes – 2010http://www.nature.com/nature/journal/v467/n7319/full/nature09534.html

Raw data: 12,145 SRA run ids submitted to Short Read Archive

Raw data: 12,145 SRA run ids submitted to Short Read Archive

Source: V. Kiermer, Nature Publishing Group, 2011

author information

live updates

Collapsible sections

Tool box to print, download reference, share: email, social media, bookmark

Figure previewer

Related content

new publishing models

doi

article-level metrics

Source: V. Kiermer, Nature Publishing Group, 2011

From The BioChemical Journal, Portland Press:Every wanted to inspect data referenced in articles? Utopia Documents allows you to interact directly with curated database entries. Play with molecular structures; edit sequence and alignment data; even plot curated tabular data yourself. http://www.biochemj.org/bj/semantic_faq.htm

8

Elsevier offers gene and protein viewers from within the article, to data stored elsewhere:

9

How big is the Data Problem ?

Depositions of datasets in archives continue to grow, surpassing journal articles

in biomedical research

Growth of biomedical research publications (red; current total >19 million), alongside the accumulation of research data, including nucleic acid sequences (black; current total ~163 million), computer-annotated protein sequences (magenta; current total 9 million), manually annotated protein sequences (green; current total 500,000) and protein structures (blue; current total 60,000)

Source: Biochemical Journal 2009 424, 317-333 - Teresa K. Attwood, Douglas B. Kell and others.

The Graph depicts the average size of a Journal of Neuroscience article and supplemental material in megabytes.

As a consequence, the Journal no longer accepts supplementary files to manuscripts, soon the supplementary material would outgrow the article volume. The burden on the peer review process became simply to large.

Editors suspect researchers to treat supplements as data dumping grounds (Emily Markus, Cell)

Publishers cannot guarantee proper preservation and future accessibility of supp files.

Maunsell J J. Neurosci. 2010;30:10599-10600

©2010 by Society for Neuroscience

How big is the Data Problem ?Too big for the Jnl of Neuroscience and Cell:

Estimated amount of data stored per research project

1%

17%

25%

40%

6%

1% 0%

11%

1%

8%

19%

41%

13%

3%0%

14%

2%5%

13%

36%

20%

5%2%

17%

0%5%

10%15%20%

25%30%35%

40%45%

0MB 1-100MB 100MB-1GB 1GB-1TB 1TB-1PB 1PB-10PB >10PB Don't Know

Current In 2 years In 5 Years

Researchers foresee higher volumes of data per research project:

Source: PARSE.Insight survey 2008

12

Data Publications Pyramid:there is data, data and data.........

(1) Data contained and

explained within the article

(2) Further data explanations in

any kind of supplementary files to articles

(3) Data referenced from the article and

held in data centers and repositories

(4) Data publications, describing available datasets

(5) Data in drawers and on

disks at the institute

The Data Publication Pyramid

14

The Pyramid’s likely short term reality:(1) Top of the

pyramid is stable but small

(2) Risk that supplements to articles turn into Data Dumping

places(3) Too many

disciplines lack a community

endorsed data archive

(4) Estimates are that at least

75 % of research data is

never made openly avaiable

15

The Ideal Pyramid (1) More integration of text and data, viewers

and seamless links to interactive

datasets(2) Only if data

cannot be integrated in

article, and only relevant extra explanations

(3) Seamless links (bi-directional)

between publications and data, interactive

viewers within the articles

(4) More Data Journals that

describe datasets, data mgt plans and data methods

16

How can publishers help to make things better• Stricter editorial policies on the availability of underlying data

• Recommend reliable and trustworthy Data Archives to authors

• Enhance articles for better integration of underlying data

• Endorse guidelines for proper citation of data

• Launch and sponsor Data Journals

• Ensure persistent identifiers and bi-directional linking

• Partner with reliable Data Archives for further integration of

Data and Publications, including interactivity for re-use.

17

What the Future Article might look like

• Articles will be less linear and more modular, offering layered presentation

of different levels of detail, providing multiple entries to deeper depths for

specialists, including to underlying data.

• Data, multimedia and other original material will become separately citable

items and even publishable items in their own right.

• Underlying data will become part of articles, via interactive pdf‘s, via gene

and protein viewers, via semantic links.

• Articles will be interactive; graphs and illustrations offer click throughs to

deeper information. Same for semantically tagged terms.

• Data Archives will ensure links from data to publications, to ensure that all

available literature is at hand for those interested in reusing the data.

Questions ?

Eefke SmitInternational Association of STM PublishersDirector, Standards and [email protected]