Archiving Research Data, Dryad,and Publishers Neil Beagrie, Charles Beagrie Ltd Bloomsbury...

Post on 28-Mar-2015

221 views 0 download

Tags:

Transcript of Archiving Research Data, Dryad,and Publishers Neil Beagrie, Charles Beagrie Ltd Bloomsbury...

Archiving Research Data, Dryad,and Publishers

Neil Beagrie, Charles Beagrie Ltd

Bloomsbury Conference June 2010

With contributions from Julia Chruszcz, Peter Williams, and Todd Vision

Overview• The Challenge;

• The Dryad Consortium;

• Supplementary Data and Publishers;

• Research Data Preservation Costs (KRDS);

• The Future.

The Challenge

4

PRC Global Study

n=3759

n=2940

n=1262

n=1653

n=2989

n=2118

n=1294

n=2565

n=1868

n=2273

n=841

n=2362

Source: PRC global study (forthcoming)

Requesting Data

• Wicherts et al. (2006 Am. Psychol. 61, 726) requested data from the 141 most recent articles in American Psychological Association (APA) journals.

“6 months later, after … 400 emails, [sending] detailed descriptions of our study aims, approvals of our ethical committee, signed assurances not to share data with others, and even our full resumes…”

Only 27% of authors shared their data

The Dryad Consortium of Scholarly Societies and publishers (and libraries)

Archiving at publication

• Avoids loss, corruption, obsolescence of data files;

• The point in time when authors are best able to ensure the correctness of data and metadata;

• Authors have incentive to deposit their data in order to complete the publication process;

• Journals are best able to monitor compliance with policy;

• In short, the “Genbank model” works.

Incentives to authors• Access to colleagues’ data• Visibility and citability

– Another way for work to have high impact

• Integration– Combinability with other data adds value

• Long-term preservation– Including data format migration

• Ad hoc data sharing can be burdensome– Deposition to multiple specialized repositories– Fulfilling individual requests for data takes effort

Joint Data Archiving Policy

• DEPOSIT AT PUBLICATION– As a condition for publication, all data used in the paper should be

archived in an appropriate public archive.

• REPEATABILITY– Data should be given with sufficient detail so that together with the

paper content, each result in the published paper may be re-created.

• EMBARGO– Authors may elect to have the data publicly available at time of

publication, or if the archive allows opt to embargo access to the data.

• EXCEPTIONS– Exceptions may be granted at the discretion of the editor, especially

for sensitive information such as the location of endangered species.

• COORDINATION– The aim is for the Dryad consortium of journals to adopt this policy

simultaneously.

That’s all well and good, but where’s this “appropriate

public archive”?

A mosaic of specialized databases• There are a growing number to which deposition

is encouraged/required (Genbank, Treebase)– And others are emerging

• A world in which every datatype had its own required database, each with its own submission system:– Would be a huge burden on authors– Would inevitably leave some data orphaned– Might never be financially possible

Overcoming the submission burden

• Integrating journal submission and data submission– Prepopulating bibliographic metadata– “Handshaking” with specialized repositories

• Enhancing low-quality author-provided metadata– Human curation– Machine assisted metadata enhancement

The Dryad Digital Repository

The Repository

• Dryad is a repository (at Duke) for datasets underlying scientific research articles;

• Its initial focus has been evolution and ecology;�• Participating journals subscribe to the Joint Data �

Archiving Policy;• Dryad datasets will have (DOIs), and Creative �

Commons ‘CC-Zero’ licenses;• Project Funded by the National Science Foundation �

2008-2012;• Sustainability plan a key deliverable.

Supplementary Data and Publishers

Overview• Consultancy for Dryad Sustainability: covered areas of draft

business plan and sustainability for Dryad

• Presenting one of the contributions(publishers) to section on Comparators and Costs

• Outcomes from desk research and 12 interviews with publishers/data publishers + some additional input drawn from Keeping Research Data Safe

• Very brief presentation – article in preparation for Learned Publishing Oct 2010 issue….KRDS2 available from JISC

Interviewees• Journal of Clinical Investigation• Journal of the American Medical Association• Molecular Phylogenetics and Evolution (Elsevier)• Journal of Heredity (OUP)• Ecological Society of America• Wiley-Blackwell + Ecology Letters• Royal Society• Federation of American Societies for Experimental Biology• OECD Publishing• Internet Archaeology and Archaeology Data Service• Pangaea: Publishing Network for Geoscientific & Environmental

Data• Dataverse Network (Social Sciences, Harvard)

Some Findings: growth• Many interviewees stated that supplementary data and

materials are showings rapid growth• 3 gave figures: from 32 articles in 2000, to 251 in 2009 – an

increase of 784%; from 6% in 2005 to 38% in 2009; from 2% a decade ago to 87% in 2009.

Some Findings: workflow• supplementary data have grown organically at the various

journals investigated (author driven);• Both the work and the costs being absorbed into the daily

running of journals;• in 4 cases minimal impact on work duties; in 5 others there was a

significant but often unquantified impact (two of these might be considered data publications with a focus on publishing data papers or datasets); and in 3 cases the information was not available or unknown;

• can be explained in terms of level of effort or importance applied : the greatest levels of effort are associated with copy editing, format migration, addition of metadata, etc, whilst the least effort is required for simply hosting the material; and/or high-levels of automation in the workflow.

Some Findings: costs• These were in most cases unknown or only partially known;• Costs mentioned but usually not quantified include: digital

storage costs, salary costs of journal staff; and long term preservation costs;

• detailed cost information was really only available from Internet Archaeology via Archaeology Data Service which had participated in an activity based costing study (KRDS2);

• Internet Archaeology archiving costs reflect those for a “dataset publisher” so only a comparator for part of Dryad’s content – large datasets.

Some Findings: revenue• only author fees and journal subscription fees were

mentioned as current revenue sources for the supplementary materials in journals;

• 3 journals interviewed have author charges for supplementary materials (see next slide);

• The data archiving and sharing organisations interviewed relied primarily on (uncertain) research grants and temporary or re-current core funding, but one had access to a small endowment and another has a charging policy for some depositors.

Some Findings: author charges• Journal of Clinical Investigation - authors are charged $300 for

supplemental data to appear online with accepted articles; • Ecological Archives - submission of ‘appendices and

supplements’ is free up to 10MB. Above this, there is a fee of $250 for the first 1 GB and $50 for each subsequent GB. The fee for publication of a data paper is $250 for publication of the abstract in the relevant journal plus publication of up to 10 MB in Ecological Archives. An additional $250 is charged for data sets between 10MB and 1GB, and for larger datasets there is an additional $50 per GB fee;

• The Federation of American Societies for Experimental Biology (FASEB) charges $100 for each Supplemental file.

Keeping Research Data Safe (KRDS1 & KRDS2):

JISC-funded studies of Research Data Preservation Costs

(separate Dryad costing project by Lori Eakin-Richards based on KRDS approach)

KRDS: what did we learn?Whole of Service costing/Seeing the“Big Picture”

Selection of 2009 Allocation of UKDA Activity Costs

Acquisition 5.8%

Ingest 21.5%

A. Storage +Pres. Planning 3.1%

Access 16.9%

KRDS:Implications

• Changing view of digital preservation costs: – “getting stuff in and out” costs much higher than

“keeping it (bit preservation + migration)”;– Staff costs c.70% of total costs;– Importance of economies of scale and

automation;– Findings of KRDS and Dryad Repository’s own

activity costing projections fed into Dryad sustainability planning.

Future Plans• Dryad sustainability plan being put to Dryad

member societies and publishers;

• Dryad extending consortium to new members –achieving economies of scale;

• Bid to JISC to establish Dryad-UK;

• Extending KRDS research and implementations.

Further InformationDryad see www.datadryad.org

Keeping Research Data Safe2 (KRDS2) webpage at www.beagrie.com/jisc.php

KRDS2 report available from JISC website http://www.jisc.ac.uk/publications/reports/2010/keepingresearchdatasafe2.aspx#downloads

Email: neil@beagrie.com