Research data and scholarly publications: going from casual acquaintances to something more
-
Upload
tjvision -
Category
Technology
-
view
1.202 -
download
0
description
Transcript of Research data and scholarly publications: going from casual acquaintances to something more
Research data and scholarly publications:
Going from casual acquaintances to something more
Todd VisionDept of Biology, University of North Carolina at Chapel Hill
and the U.S. National Evolutionary Synthesis Center
ALPSP, September 2011Abort, Retry, Fail? Data and the scholarly literature
Peer-to-peer ‘sharing’ fails
Wicherts and colleagues requested data from from 141 articles in American Psychological Association journals.
“6 months later, after … 400 emails, [sending] detailed descriptions of our study aims, approvals of our ethical committee, signed assurances not to share data with others, and even our full resumes…” only 27% of authors complied Wicherts, J.M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The
poor availability of psychological research data for reanalysis. American Psychologist, 61, 726-728.
Info
rmat
ion
Co
nte
nt
Time
Time of publication
Specific details
General details
Accident
Retirement or career change
Death
(Michener et al. 1997)
Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. Biological Lectures from the Marine Biological Laboratory: 209-226.
Source: Publishing Research Consortium, http://publishingresearch.net
n=3824
Taxonomy of data archiving benefits
Modified from Beagrie et al. (2009) Keeping Research Data Safe 2
DirectVerification of published researchPreserving accessibility to dataAllowing reuse and repurposing of dataDiscoverability of data
Indirect (costs avoided)Redundant data collectionInefficient legacy data curation Burden of sharing-upon-requestOpportunity cost of science not done
Near termProtection against personnel turnoverAvailability for review and validation
Long termSecure long-term stewardshipIncreased impact per publication
PrivateIncreased citationsNew collaborations New research opportunitiesFulfilling funding mandates
PublicMore efficient use of research dollarsPublic trust in scienceEducational opportunitiesImproved methodologiesMore informed policy
10
Joint Data Archiving Policy (JDAP)
Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future.
As a condition for publication, data supporting the results in the article should be deposited in an appropriate public archive.
Authors may elect to embargo access to the data for a period up to a year after publication.
Exceptions may be granted at the discretion of the editor, especially for sensitive information.
Whitlock, M. C., M. A. McPeek, M. D. Rausher, L. Rieseberg, and A. J. Moore. 2010. Data Archiving. American Naturalist. 175(2):145-146.
The long tail of orphan data in “small science”
Volu
me
Rank frequency of datatype
Specialized repositories(e.g. GenBank, PDB)
Orphan data
after B. Heidorn
“Most of the bytes are at the high end, but most of the datasets are at the low end” – Jim Gray
Smit E (2011) Abelard and Héloise: Why Data and Publications Belong Together. D-Lib Magazine doi:10.1045/january2011-smit
• The End To make data archiving and reuse standard part of research and
publishing.
• The Means Enable low-burden data archiving at the time of manuscript submission. Promote researcher benefits from data archiving. Promote responsible data reuse. Empower journals, societies & publishers in shared governance. Ensure sustainability and long-term preservation.
• The Scope Data underlying peer-reviewed articles in basic and applied biosciences.
Submit manuscript
Integrated
Manuscript metadata
Submit manuscript
Integrated
Prompt author
Submit data
Manuscript metadata
Submit manuscript
Integrated
Prompt author
Submit data
Manuscript metadata
Peer review
Review passcode
Submit manuscript
Integrated
Prompt author
Submit data
Manuscript metadata
Peer review
Review passcode
Acceptance notification Curation
Data DOIProduction
Submit manuscript
Integrated
Prompt author
Submit data
Manuscript metadata
Peer review
Review passcode
Acceptance notification Curation
Data DOIProduction
Article metadata Curation
Submit manuscript
Integrated
Prompt author
Submit data
Manuscript metadata
Peer review
Review passcode
Acceptance notification Curation
Data DOIProduction
Article metadata Curation
ArticlePublicatio
n
Data publicati
on
Submit manuscript
Integrated
Prompt author
Article DOI/final metadata harvested
Submit data
Manuscript metadata
Peer review
Review passcode
Acceptance notification Curation
Data DOIProduction
Article metadata Curation
ArticlePublicatio
n
Data publicati
on
Non-integrated
Submit data
Submit manuscript
Integrated
Prompt author
Article DOI/final metadata harvested
Submit data
Manuscript metadata
Peer review
Review passcode
Acceptance notification Curation
Data DOIProduction
Article metadata Curation
ArticlePublicatio
n
Data publicati
on
Non-integrated
Submit data
Author includes
data DOI
Data DOI
Submit manuscript
Integrated
Prompt author
Article DOI/final metadata harvested
Submit data
Manuscript metadata
Peer review
Review passcode
Acceptance notification Curation
Data DOIProduction
Article metadata Curation
ArticlePublicatio
n
Data publicati
on
Non-integrated
Submit data
Author includes
data DOI
Data DOI
Article publicati
on
DOI/final metadataharvested
Submit manuscript
Integrated
Prompt author
Article DOI/final metadata harvested
26
Dryad relative to Supplementary Online Materials
Dryad SOM
Discoverable: indexed and exposed to both web and bibliographic search engines ✔ ✗
Identifiable: DataCite DOIs within articles serve as permanent, resolvable identifiers ✔ ✗*
Permanent: processes in place to promote preservation (incl. format migration) ✔ ✔/✗**
Curated: quality control by both automated processes and human inspection ✔ ✗*
Ease of deposit: streamlined deposit, allowance for large and complex datasets ✔ ✔/✗**
Formatted for reuse: support for non-PDF file formats ✔ ✔/✗**
Updatable: new versions of data files can be added, metadata can be enhanced ✔ ✗
Support for embargoes: can delay release of data in accordance with journal policy ✔ ✗
Free reuse: no paywall, clear terms of reuse (all data released under CC Zero) ✔ ✔/✗**
Economy of scale: cost efficiency from shared infrastructure ✔ ✔/✗**
Alignment to organizational mission: focus on archiving and reuse of scientific data ✔ ✗
* A few publisher SOM sites are exceptions to the general rule** Practices differ among publishers, see Smit (2011), doi:10.1045/january2011-smit
Article citationWu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M,
Venter JC, Eisen JA (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011
Data citationWu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M,
Venter JC, Eisen JA (2011) Data from: Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. Dryad Digital Repository. doi:10.5061/dryad.8384
Rebbeck CA, Leroi AM, Burt A (2011) Mitochondrial capture by a transmissible cancer. Science 331, 303
0
200
400
600
800
1000
Number of data packages
_x0
_x0
_x0
_x0
_x0
_x0
_x0
_x0
0
100
200
300
400
500Number of files
10
100
1000
10000
100000
100
1000...
1000...
1000...
0
100
200
300
400
500
Total data package size (bytes)
20 papers from Delsuc and Douzery going back to 2002
By now, downloaded >1000X
Fulfilling the role of a journal
Journal Dryad
Registration ✓ ✓
Certification ✓ (peer review)
✓ (curation)
Awareness ✓ ✓ (distribution)
Archiving ✓
Rewarding ✓ ✓
Does sharing imply that it need be altruistic?
Piwowar H, et al. (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308.
• For a set of 85 cancer microarray clinical trials 48% had publicly available data These received 85% of the article
citations Independent of journal impact
factor, publication date, author nationality
Does sharing imply that it need be altruistic?
Piwowar H, et al. (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308.
• For a set of 85 cancer microarray clinical trials 48% had publicly available data These received 85% of the article
citations Independent of journal impact
factor, publication date, author nationality
Piwowar HA, Chapman WW (2008) A review of journal policies for sharing research data. Presented at ELPUB2008, Nature Precedings hdl:10101/npre.2008.1700.1
Data policies among bioscience journals
n=70
IF=3.6
IF=4.5
IF=6.0
The value proposition
• For researchers Increase the impact of, and citations to, published
research. Preserve and make data available to verify published
results, to refine methodologies, and to repurpose. Free researchers from the burden of data preservation and
access.
• For journals, publishers and societies Free journals from the burden of managing supplemental
data Increase the discoverability, impact, and integrity of
articles Increase their value to the community they serve.
• For funders A cost-effective mechanism to make research more
accessible Leverage existing investments in order to enable new
science
Sustainability and governance• Business model
Long-term preservation requires a long-term organization
In Dryad’s case, a membership-based nonprofit Revenue received from a broad array of ‘customers,
including journals, societies, publishers, and researchers
• Deposit charges Paid upfront, when the majority of costs are incurred Ensure free access to the data in perpetuity Allow revenue to naturally scale with costs (i.e. volume
of deposits) Distribute costs fairly among stakeholders
• Governance 12 member Board of Directors nominated, elected by
Membership Membership serves in advisory capacity, and is a
community of practice
Costs
• Moderate economies of scale are required At 10K packages/yr, <$50/deposit, depending on
curation
• What are the costs for SOM? Journal of Clinical Investigation: $300 flat fee Ecological Archives: $250 <10Mb, more fees beyond
that FASEB: $100 per file
Beagrie N, Eakin-Richards L, Vision TJ (2009) Business models and cost estimation: Dryad repository case study. iPRES 2010
Proposed payment plans
1. Journal-based annual fee based on all research articles published/yr
(~$25/per*) covers any deposits from the journal (even from prior
yrs)
2. Voucher-based pay in advance for some number of deposits
(<$50/per deposit)
3. Pay-as-you-go: be invoiced retrospectively for deposits (>$50/per
deposit)
4. Author-pays Author pays online at time of deposit Journal can still facilitate archiving through
submission integration
* These are rates for Members, which include a 10% discount
What is the return on investment?
• A rigorous framework is lacking But we can look at comparators
• Marginal cost of data archiving $50/article is <2% of of publication costs (>$2.5K) And 0.2% of grant costs/article (~$25K)
• Is the data worth 2% of the research investment? Using DNA microarray data in GEO as a model 2,711 submissions in 2007 Data reused by 3rd parties in >1,150 articles
Vision (2011) Open data and social contract of scientific publishing. BioScience, 60(5):330-330 Piwowar H, Vision TJ, Whitlock MC (2011) Data archiving is a good investment. Nature 473:285
• http://datadryad.org• http://blog.datadryad.org• http://datadryad.org/wiki• http://code.google.com/p/dryad• [email protected]• @datadryad• Dryad
A very incomplete list of contributors
JDAP: M. WhitlockDryadUS. R. Scherle, E. Feinstein, J. Greenberg,
H. Piwowar, P. SchaefferDryadUK: B. Hole, Max Wilkinson, D. ShottonSustainability planning: N. Beagrie, L. Eakin-
Richards