Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International...

51
Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 Slideshow-URL

Transcript of Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International...

Software workflows as research objects

Rob L Davidson, Chris I HunterISI CODATA International Training Workshop on Big Data

11th March 2015Slideshow-URL

Article: http://econ.st/1o12gCN

• Big data! (The new oil)

Article: http://bit.ly/1AN8ysJ

Source: @flowchainsensei

Analysis

Software

Article: http://bit.ly/1xdCxbY

Article: http://bit.ly/1Mdll03

Yay, we’re all unicorns!

Are you recruiting a data scientist or a unicorn? http://ubm.io/1Gpxizh

Source: http://bit.ly/1MdA8rI

But why arewe sad unicorns?

Measuring software reproducibility

• 515 papers (429 conference, 86 journal)• <30% reproduciblehttp://reproducibility.cs.arizona.edu

Measuring software reproducibility

Reasons for failure

“The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.”

Cost of failure

• Waste time• Waste money• Frustrating• Distrust

How to fix it

The path to enlightenment

• A word from the experts (4 x 10 simple rules)• Share code

– licenses• Share environment

– Codify the environment• Share workflows

– All parameters, versions, order of steps– GalaxyProject.org

• Share outputs– Share intermediate results– Share code for figures– Codify publications

Slideshow URL……………..

A word from the experts

A word from the experts

A word from the experts: 1

• Keep it simple– Don’t be a perfectionist– Aim for multiple versions– Optimise/improve later– Get feedback/help from community

• Hastings #1 + Prlic # 5

A word from the experts: 2

• Versioning – Use a versioning system (e.g. Github)– Allow others to know what version they use– Release early, release often (Linus Torvalds)– Get help from community

• Seemen # 3, Hastings # 10, Sandve #3/4

A word from the experts: 3

• Use good coding practice– You don’t have to be the best– Learn from others– Become involved in a community– Write as though others will be watching

• Prlic #2 + all of Seemen and Hastings

A word from the experts: highlight

• Start simple• Release early• Use versioning• Build a community• Get community feedback, testing, support

• …but wait, won’t that mean???

Sharing code

Sharing code

• “Scientific software…public release is then only considered around the time of publication” – prlic #4

• “the fear of getting scooped”– Reality: “staking a claim in the field”

Sharing code: don’t worry• Share early

– Be simple– Don’t be perfectionist

• CRAPL license

Source: http://matt.might.net/articles/crapl/

Sharing code: licenses• Know your licenses

– Apache License 2.0– BSD 3-Clause “New” or “Revised”– BSD 2-Clause “simplified” or “FreeBSD”– GNU (GPL)– MIT– Mozilla Public License 2.0– etc

Source: http://opensource.org/licenses

Sharing code: repositories

• Github• Sourgeforge• Zenodo• GigaDB/GigaGalaxy

• Versioning, sharing, collaboration, community feedback

Sharing environment

Your environment

• How hard would it be to start from scratch?• What if you move from Ubuntu to Centos?

• IF it took you a while to set up your box, if you hesitate to set it up for your colleagues…– Create a virtual machine or ‘docker’ image that

can be shared whole. – Time-stamp of working system

Share your environment

• Virtual machine– Copy your exact environment– If it works for you, it works for anyone– Reproducibility, frozen in time

Share your environment

• Docker– ‘light’ vm – Discrete unit of code+environment– Can be called like a compiled tool

• New possibilities e.g. nucleotid.es benchmarking– Data-driven peer-review

Share your environment

• VM = black box?• Docker == black box!• http://ivory.idyll.org/blog/vms-considered-

harmful.html

Codify your environment

• Provisioning scripts are ‘research objects’• Improves adaptability (easier to recode for

alternative OS etc)• Builds in extra documentation• Easier to share – although GigaDB still wants a

compiled snapshot (i.e. full machine)

List of provisioning systems

• Vagrant• Chef• Salt• Ansible

Sharing pipelines

Share your pipeline

• Any analysis is a string of tools with a great many parameters

• The order of the sequence, the version of each part and the inputs and outputs are never fully explained

• These should be shared!• Help is at hand: there are many ‘workflow’

systems for this

List of workflow systems

• Galaxy• Knime• Taverna

Galaxy

Over 36,000 main Galaxy server users

Over 1,000 papersciting Galaxy use

Over 55 Galaxyservers deployed

Open source

http://galaxyproject.org

Galaxy User Interface

Tool List Tool Parameters History/results

Galaxy: Under the hood

<tool name=”myfunction”> <command> python myfunction input1 </command> <inputs> <param format=”txt” name=”input1”> </inputs> <outputs> <data format=”csv” name=”output1”> </outputs></tool>

Basic xml 'wrapper'

Describe inputs and outputs

Calls command

Monitors for output

Logs/returns to 'history'

Galaxy Workflow: visualise

Galaxy Workflow: visualise

Galaxy Workflow: visualise

Galaxy Workflow: export

Citable workflowAdd as supplemental files or publish with distinct DOI via GigaDB or FigShare

Galaxy Toolshed

https://toolshed.g2.bx.psu.edu/

Many 'omics, stats,

visualisations

2700+ tools!

Download;Run instantly

Sharing outputs

Share outputs – intermediate results

• Workflow systems help with this• If a part of your analysis can’t be replicated

– Requires a license– Is no longer compatible – Just plain won’t work

• The rest of the analysis can still be used• (show diagram)

Share outputs – code for figures

Share outputs – codify publication

• KnitR e.g. http://www.gigasciencejournal.com/content/3/1/3

• Options given here: http://www.gigasciencejournal.com/content/3/1/19– R: KnitR, Sweave, R-Markdown– Javascript: Tangle, Active Markdown (CoffeeScript)– Python: Ipython Notebooks – iReport links this functionality for Galaxy

Research objects

• Project proposal • Project experimental SOPs • Images of equipment, subjects, conditions• RAW data• Meta-data• Analysis code, parameters, pipelines• Analysis environment, VM or provisioning script• Intermediate results• Publication figures/images/tables: codify• Publication text

Share earlyShare widelyShare openly

Slideshow URL