20171003 lancaster data conversations Chue-Hong

43
www.software.ac.uk Software a different kind of research object? http:// dx.doi.org / 10.6084/m9.figshare. 5459542 3 rd October 2017, Lancaster Data Conversations, Lancaster Neil Chue Hong (@ npch ), Software Sustainability Institute ORCID: 0000 - 0002 - 8876 - 7606 | [email protected] Slides licensed under CC-BY where indicated: Supported by Project funding from

Transcript of 20171003 lancaster data conversations Chue-Hong

Page 1: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Software – a different kind of research object?

http://dx.doi.org/10.6084/m9.figshare.5459542

3rd October 2017, Lancaster Data Conversations, LancasterNeil Chue Hong (@npch), Software Sustainability InstituteORCID: 0000-0002-8876-7606 | [email protected]

Slides licensed underCC-BY where indicated:

Supported by Project funding from

Page 2: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.ukWhat’s software got to do with my research?

Page 3: 20171003 lancaster data conversations Chue-Hong

The research community

relies on software

Do you use research

software?

What would happen to your

research without software

Survey of researchers from 15 Russell Group universities conducted by SSI between August - October 2014.

406 respondents covering representative range of funders, discipline and seniority.

56%Develop their

own software

71%Have no formal

software training

Page 4: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Software in Nature

Nangia and Katz: https://arxiv.org/pdf/1706.06527.pdf

Page 5: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Raise standards for preclinical cancer research

47 out of 53 “landmark” publications

could not be replicated Be

gley

, Elli

s. N

atu

re, 4

83

, 20

12

do

i:10

.10

38

/48

35

31

a

Page 6: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Repeatability of published microarray gene expression

analyses56% of analyses could not be repeated,

of which 30% were because of software issues. 50% did not state software version, 39% did not provide raw data.

Only 11% could be reproduced satisfactorily.

Ioannidis et al. Nature Genetics, 41, 2010doi:10.1038/ng.295

Page 7: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Repeatability in Computer Science

Of 401 papers in ACM Computer Science journals and proceedings, only 85 provided a link to software.For 176 the software could not be obtained.

Collberg, Proebsting, Warren, University of Arizona TR 14-04, 2015 http://reproducibility.cs.arizona.edu/v2/RepeatabilityTR.pdf

Page 8: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Errors due to bioinformatics pipeline

The results presented in the Report “Ancient Ethiopian genome reveals extensive Eurasian admixture throughout the African continent“ were affected by a bioinformatics error – identified because of open science

Llorente et al. Science, 350, 6262doi:10.1126/science.aad2879

Page 9: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

T

Page 10: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.ukIsn’t software just a typeof data?

Page 11: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Authorship Lifecycle

IdentifyCite

Reuse

Research

Index

Papers, data, software all research outputs ofa continuous cycle.

With software, technologymakes it easier to track, but not reward.

We cannot separatepapers, data and softwarewhen we release research.

http://openresearchsoftware.metajnl.com

Page 12: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

The current process

Startresearch

Writesoftware

Usesoftware

Produceresults

Publishresearch

paper

Releasedata

Releasesoftware

Which mentions software and data

This process is simple but does not reward production orreuse of good software and data.

It also has a long contribution cycle.

Page 13: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Writesoftware

A better process?

Startresearch

Identifyexisting

software

Usesoftware

Produceresults

Publishresearch

paper

Adapt/extend

software

Releasedata

Releasesoftware

Publishsoftware

paper Publishdata

paper

Wh

ich referen

ces so

ftware an

d d

ata pap

ers

Software and data papers are needed as proxies for rewarding reuse.

But it enables a shorter contribution cycle for data and software.

Page 14: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

What do we choose to identify:- Workflow?- Software that runs workflow?- Software referenced by workflow?- Software dependencies? What’s the minimum citable part?

Boundary

http://dx.doi.org/10.6084/m9.figshare.1497930

Page 15: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Algorithm

Function

Pro

gram

Library / Su

ite / Package

Granularity

http://dx.doi.org/10.6084/m9.figshare.1497930

Page 16: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.ukVersioning

Personalv1

Personalv2

Personalv3

Personal v2a

Public v1

Personal v3a

Personal v2a

Public v2

Public v3

Why do we version?- To indicate a change- To allow sharing- To confer special status

http://dx.doi.org/10.6084/m9.figshare.1497930

Page 17: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

AuthorshipAuthorship• Which authors have had what impact on each version of the software?• Who had the largest contribution to the scientific results in a paper?

http://beyond-impact.org/?p=175

OGSA-DAI projects statistics from Ohloh

http://dx.doi.org/10.6084/m9.figshare.1497930

Page 18: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.ukIf software is so important, why is most of it hard to reuse?

Page 19: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

The Software Sustainability Institute

A national facility for cultivating better, more sustainable, research software to enable world-class research• Software reaches boundaries in its

development cycle that prevent improvement, growth and adoption

• Providing the expertise and services needed to negotiate to the next stage

• Developing the policy and tools tosupport the community developing andusing research software Supported by EPSRC Grant EP/H043160/1

+ EPSRC/ESRC/BBSRC grant EP/N006410/1

Page 20: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

, it’

Victoria Stodden, AMP 2011 http://www.stodden.net/AMP2011/,

Special Issue Reproducible Research Computing in Science and Engineering July/August 2012, 14(4)

Howison and Herbsleb (2013) "Incentives and Integration In Scientific Software Production" CSCW 2013.

Page 21: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

T

Research Culture Needs Changing

“This particular project was something I wrote a couple years ago to help me out with a workflow… I’d put it up on Github, so that others could potentially use it or use the code. So I went to see what people were saying about this project. It seemed liked I’d done something fundamentally wrong, so stupid that it flabbergasts someone... So of course I start sobbing. Then I see these people’s follower count, and I sob harder. I can’t help but think of potential future employers that are no longer potential.”

http://www.software.ac.uk/blog/2013-01-25-haters-gonna-hate-why-you-shouldnt-be-ashamed-releasing-your-code

Page 22: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

T

Research Culture Needs Changing

Our research culture presents barriers but few incentives to sharing code

• There is a fear of being “found out” for poor code, but no encouragement or resources to improve software engineering skills

• There is no reward for publishing code in the current system of metrics. Researchers fear being “scooped” or losing ability to publish.

• Many organisations do not understand how to exploit open source licenses

Page 23: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.ukNever be ashamed of making your software available

Page 24: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

T

Vandewalle (2012) DOI: 10.1109/MCSE.2012.63

Page 25: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Research Software Workflow

develop share preserve

Developed and versioned using code repository

Published via code repositoryor website

Deposited in digital repositorywith paper / for preservation

Page 26: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Good Enough Practices To Please Your Future Self

• Data:

Save and backup raw data

Create analysis-friendly data

Record your processing steps

Anticipate the need to use multiple tables, and use a unique identifier for each record

Submit data to a repository and get a DOI

Page 27: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Good Enough Practices To Please Your Future Self

• Software: Document for your future self:

• Brief descriptive comment at the start of your code • Provide a simple example or test data set• Give functions and variables meaningful names• Make dependencies and requirements explicit

Learn to be modular• Break programs into functions• Don’t duplicate functionality• Search for well-maintained libraries that do what you need

Make it accessible in the future• Make the license explicit• Keep track of changes• Submit code to a reputable DOI-issuing repository

Good Enough Practices in Scientific Computing: https://doi.org/10.1371/journal.pcbi.1005510

Page 28: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

What you can do now

• Make sure you’re using version control

• Write a README file that describes how you can get your code up and running, and give it to a colleague to try out

What it does, requirements / dependencies, simple example of use and input + output data

• Ask a collaborator to contribute a new piece of functionality, and get feedback on the process

• Talk to your library / IT services about the services they offer

Page 29: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Get some training

Teach basic lab skillsfor scientific computing

so that researchers can do more in less time and with less pain.

Teach basic concepts, skills and tools for working more effectively with data. Workshops are designed for people with little to no prior computational experience.

[email protected]

[email protected]

Open source learning, that can be tailored to disciplines.“Train the trainers”: building a capable base of instructors.

Page 30: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Three months from now, you will thank yourself!

Page 31: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Interested in more?

• Publish a software paper

http://bit.ly/softwarejournals

• Easily archive your GitHub Code and make ircitable

GitHub to Zenodo

GitHub to FigShare

• Software Citation Implementation WG

https://www.force11.org/group/software-citation-implementation-working-group

Page 32: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Literate Programming

• Traditional papers are just advertisements

A literate computing document is the research

• The technology is out there

Jupyter notebooks

Mathematica

R Markdown

knitR

MATLAB Live scripts

Page 33: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

• LIGO Paper:

http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.116.061102

• LIGO Notebook:

https://losc.ligo.org/s/events/GW150914/GW150914_tutorial.ipynb

LIGO Example

Page 34: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

SSI Fellows 2018 / CW18

• SSI Fellowships

Deadline: 9th October 2017

£3000 bursary to be a research software advocate

Join a network of great people working to improve

• Collaborations Workshop 2018

Cardiff, 26-28th March 2018

Theme: “Culture Change and Productivity”

The un-conference that most participants would recommend to their colleagues

Page 35: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

T

Without data it’s difficult to validate results. But without code, we waste the opportunity to advance science.

These slides: http://dx.doi.org/10.6084/m9.figshare.5459542

“The only way to publish software in a scientifically robust manner is to share source code, and that means publishing via the internet in an open-access/open-source fashion. —Warren Lyford DeLano, Creator of PyMOL, 2005

Page 36: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

The Software Sustainability Institute

A national facility for cultivating better, more sustainable, research software to enable world-class research• Software reaches boundaries in its

development cycle that prevent improvement, growth and adoption

• Providing the expertise and services needed to negotiate to the next stage

• Developing the policy and tools tosupport the community developing andusing research software Supported by EPSRC Grant EP/H043160/1

+ EPSRC/ESRC/BBSRC grant EP/N006410/1

Page 37: 20171003 lancaster data conversations Chue-Hong

Software

Policy

Training

Community

Outreach

Delivering essential software

skills to researchers via CDTs,

institutions & doctoral schools

Helping the community to

develop software that meets the

needs of reliable, reproducible,

and reusable research

Collecting evidence

on the community’s

software use & sharing

with stakeholders

Bringing together

the right people to

understand and address

topical issues

Exploiting our platform to

enable engagement,

delivery & uptake

Page 38: 20171003 lancaster data conversations Chue-Hong

Website & blog

Campaigns

Advice

Guides

Courses

Workshops

Fellowship

Research

Software

Policy

Training

Community

Consultancy50+ projects

130+ evaluations

4 surgeries

35+ UK SWC

workshops

1000+ learners

80+ guides

50,000 readers

61 domain

ambassadors

20+ workshops organised

740 researchers

50,000 grants

analysed

150+ contributed articles

20,000 unique visitors per month

3,000 Twitter followers

300+ RSEs engaged 2100 signatures 13 issues highlighted

Outreach

Page 39: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Find out more about the SSI

• Community Engagement (Lead: Shoaib Sufi) Fellowship Programme Events and Workshops

• Consultancy (Lead: Steve Crouch) Open Call for Projects / Collaborations Software Evaluation

• Policy and Publicity (Lead: Simon Hettrick) Case Studies / Policy Campaigns Software and Research Blog

• Training (Lead: Aleksandra Nenadic) Software Carpentry and Data Carpentry (300+ students/year) Guides and Top Tips

• Journal of Open Research Software (Editor: Neil Chue Hong)

• Collaboration between universities of Edinburgh, Manchester, Oxford and SouthamptonSupported by EPSRC Grant EP/H043160/1 + EPSRC/ESRC/BBSRC grant EP/N006410/1

Page 40: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

Page 41: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

T

Research Culture Needs Changing

But there’s still a lot to be done• Software Assessment

• Software Management Plans

• Group Identifiers Software project teams – encompassing

contributors

Software products – across versions

• Machine readable references Software papers solve the credit problem

The reference problem is still hard• Where is software mentioned and can we find it

Page 42: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

T

Research Culture Needs Changing

Mechanisms are becoming available• Roles

Project Credit http://credit.casrai.org/

Transitive Credit http://doi.org/10.5334/jors.be

• Mechanisms Software papers http://bit.ly/softwarejournals

Software citation https://doi.org/10.7717/peerj-cs.86

• Tools Researcher Identifiers e.g. ORCID http://orcid.org/

Alt-Metrics e.g. ImpactStory http://impactstory.org/

• Metadata CodeMeta http://codemeta.github.io/

Page 43: 20171003 lancaster data conversations Chue-Hong

Software Sustainability Institute

www.software.ac.uk

T

Research Culture Needs Changing

Software Referencing needs • Where is software referenced in publications?

• How can we understand its influence?

• How can we choose between software?

Howison, Bullard 2015. DOI: 10.1002/asi.23538