Software Repositories for Research -- An Environmental Scan

18
Repositories for Research -- An environmental scan Micah Altman MIT Libraries Prepared for Software Preservation Network Forum Atlanta August 2016

Transcript of Software Repositories for Research -- An Environmental Scan

Page 1: Software Repositories for Research -- An Environmental Scan

Software Repositories for

Research-- An

environmental scanMicah

AltmanMIT Libraries

Prepared for Software

Preservation Network Forum

AtlantaAugust 2016

Page 2: Software Repositories for Research -- An Environmental Scan

DisclaimerThese opinions are my own, they are not the opinions of MIT, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators

Secondary disclaimer:

“It’s tough to make predictions, especially about the future!”-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx,

Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.

2

Page 3: Software Repositories for Research -- An Environmental Scan

Related Publications• Altman M, Jackman S. “Nineteen Ways of Looking at

Statistical Software”. Journal of Statistical Software. 2011;42.

• Altman, Micah, and Gary King. "A proposed standard for the scholarly citation of quantitative data." D-lib 13, no. 3 (2007):

• Altman, M., Gill, J. and McDonald, M.P., 2004. Numerical issues in statistical computing for the social scientist. John Wiley & Sons.

Reprints available from:informatics.mit.edu

3

Page 4: Software Repositories for Research -- An Environmental Scan

Today’s Perspectives* Methods *

* Measures * * Merit *

4

Page 5: Software Repositories for Research -- An Environmental Scan

Methods

5

Page 6: Software Repositories for Research -- An Environmental Scan

Literature Review

Data Curation, Publication and Citation

Software significant properties, use cases

Software repositories

Software & scientific reproducibility

6

Page 7: Software Repositories for Research -- An Environmental Scan

Web Research - PracticeReview of research repositories

Sources: OpenDOAR, Re3Data, Sherpajuliet

Goals: Estimate prevalence of repositories that accept research software; identify exemplar repositories, characterize feature sets by repository category

Methods: term-based queries; descriptive statistics; stratified content case studies

Review of Software Directories Sources: OpenHub, OSDir, DMOZ

Goals: Identify additional software repositories used in research

Methods: qualitative text analysis; descriptive statistics7

Page 8: Software Repositories for Research -- An Environmental Scan

Web Research - PoliciesReview of funder policies

Sources: Roarmap; US Federal Agency Websites

Goals: Estimate prevalence of funder policies on software curation; identify exemplar policies; identify recommended repositories

Methods: qualitative text analysis; descriptive statistics

Review of Journal PoliciesSource: Google Scholar, WoS, DOAJ, Software Sustainability Institute Index

Goals: Estimate prevalence of journals that publish research software; prevalence of software policies at journals exemplar policies; identify recommended repositories

Methods: qualitative text analysis; descriptive statistics8

Page 9: Software Repositories for Research -- An Environmental Scan

Measures

9

Page 10: Software Repositories for Research -- An Environmental Scan

Typical Prevalence of Software Repositories

10

Page 11: Software Repositories for Research -- An Environmental Scan

Some Exemplars and Promising Initiatives

• Citation and publisher policiesFORCE 11 Software Citation Principles

www.force11.org/software-citation-principles

ACM New Publication Policies on Software Reproducibility and Contributorshipwww.acm.org/publications/policies

PLOShttp://journals.plos.org/plosone/s/materials-and-software-sharing

11

• Long Term Access:- www.softwareheritage.org - guides.github.com/activities/citable-code/- archive.org/details/softwarelibrary

• Software Journals:- www.journals.elsevier.com/softwarex/ - www.jstatsoft.org/ - http://openresearchsoftware.metajnl.com/

Page 12: Software Repositories for Research -- An Environmental Scan

Use Cases and Motivating Value

12

Historic / cultural - historical scholarship- “intrinsic value”

Replication and reproducibility - check claims made in research- reduced deliberate research fraud- check reliability (robustness) of results- check validity (accuracy)

Reuse - efficiency - increase speed of development- standards compliance- apply methodology to a different corpus- increased quality and dependability

Render other digital objects - renders other objects meaningful - see digital preservation use cases

Legal - record of licensing, ownership, copyright- manage legal risks/accountability- compliance with laws/funding mandates- reduce barriers to long-term access for other historic use, replication, reuse, rendering

Citation and attribution - track individual academic career- track software development/history- track institutional outputs- track funder outputs

Page 13: Software Repositories for Research -- An Environmental Scan

Repository Affordances

13

Authoring/Development

Discovery/Access

Collection Preservation Legal

creator Language specific authoring toolsBuild environment integrationVersioningDocumentation Project managementCollaboration

Attribution BackupsCommitment to long-term access

Access controlLicense templating

curator Project managementLicense templateMonitoringCollaboration

BrowsingSearchingPersistent IdentifiersVersion Ids

Collection PolicyPeer ReviewSelectionAnnotationMetdata

Preservation policyDocumentationFormat management

Access control License standardizationLegal guidance

institution Author, Funder IdentifiersMetrics

Author, Funder IdentifiersMetrics

Author, Funder IdentifiersMetricsComplianceAttribution

Preservation PolicyPreservation replicationAuditabilityCertification

License standardizationPrivacy Management

end-user BrowsingSearchingSearch engine integrationersistent IdentifiersVersion Ids

Selection criteriaAnnotationQuality Measures

Documentation Open licensingLicense discoverability

Page 14: Software Repositories for Research -- An Environmental Scan

Merit

14

Page 15: Software Repositories for Research -- An Environmental Scan

State of Software Curation1.No comprehensive indices of software archives2.Orders of magnitude fewer software archives than data archives.

( Corollary: Institutional repositories offer little functionality for software archiving, even when nominally supported )

3.Very small proportion of funders have policies addressing software curation

4.There is little available advice for researchers who wish to curate, cite, & preserve software

5.Substantial reproducibility reproducibility failures related to software continue to be reported

15

Page 16: Software Repositories for Research -- An Environmental Scan

Contrast with Data Curation -- Lack of Progress• Compliance

– Funder: data management plans, open data– Publishers: data access/archiving/citation

• Norms & practices– Joint data citation principles– Recognition of data in funder biosketches– Increased recognition of reproducibility gaps– Increased recognition of open data/open science

• Technical infrastructure– Data repository directories– Data citation indices– ORCID researcher identifier and registry

• Recognition– Data citation indices– Virtual branded archives– High-profile data publications

16

Page 17: Software Repositories for Research -- An Environmental Scan

Summing it all up… Software curation looks a lot like data curation a decade ago…

17

“How much slower would scientific progress be if the near universal standards for scholarly citation of articles and books had never been developed? Suppose shortly after publication only some printed works could be reliably found by other scholars; or if researchers were only permitted to read an article if they first committed not to criticize it, or were required to coauthor with the original author any work that built on the original. How many discoveries would never have been made if the titles of books and articles in libraries changed unpredictably, with no link back to the old title; if printed works existed in different libraries under different titles; if researchers routinely redistributed modified versions of other authors' works without changing the title or author listed; or if publishing new editions of books meant that earlier editions were destroyed? …

“Unfortunately, no such universal standards exist for citing quantitative data software, and so all the problems listed above exist now. Practices vary from field to field, archive to archive, and often from article to article.

The data software cited may no longer exist, may not be available publicly, or may have never been held by anyone but the investigator. Data software listed as available from the author are unlikely to be available for long and will not be available after the author retires or dies. … Data software are sometimes listed in the bibliography, sometimes in the text, sometimes not at all, and rarely with enough information to guarantee future access to the identical data set. Replicating published tables and figures even without having to rerun the original experiment, is often difficult or impossible”

-- Altman & King 2007

Page 18: Software Repositories for Research -- An Environmental Scan

Questions?Web:

Informatics.mit.edu

Email:

[email protected]

18