Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International...
-
Upload
jerome-lee -
Category
Documents
-
view
215 -
download
1
Transcript of Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International...
Software workflows as research objects
Rob L Davidson, Chris I HunterISI CODATA International Training Workshop on Big Data
11th March 2015Slideshow-URL
Measuring software reproducibility
• 515 papers (429 conference, 86 journal)• <30% reproduciblehttp://reproducibility.cs.arizona.edu
Reasons for failure
“The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.”
The path to enlightenment
• A word from the experts (4 x 10 simple rules)• Share code
– licenses• Share environment
– Codify the environment• Share workflows
– All parameters, versions, order of steps– GalaxyProject.org
• Share outputs– Share intermediate results– Share code for figures– Codify publications
Slideshow URL……………..
A word from the experts: 1
• Keep it simple– Don’t be a perfectionist– Aim for multiple versions– Optimise/improve later– Get feedback/help from community
• Hastings #1 + Prlic # 5
A word from the experts: 2
• Versioning – Use a versioning system (e.g. Github)– Allow others to know what version they use– Release early, release often (Linus Torvalds)– Get help from community
• Seemen # 3, Hastings # 10, Sandve #3/4
A word from the experts: 3
• Use good coding practice– You don’t have to be the best– Learn from others– Become involved in a community– Write as though others will be watching
• Prlic #2 + all of Seemen and Hastings
A word from the experts: highlight
• Start simple• Release early• Use versioning• Build a community• Get community feedback, testing, support
• …but wait, won’t that mean???
Sharing code
• “Scientific software…public release is then only considered around the time of publication” – prlic #4
• “the fear of getting scooped”– Reality: “staking a claim in the field”
Sharing code: don’t worry• Share early
– Be simple– Don’t be perfectionist
• CRAPL license
Source: http://matt.might.net/articles/crapl/
Sharing code: licenses• Know your licenses
– Apache License 2.0– BSD 3-Clause “New” or “Revised”– BSD 2-Clause “simplified” or “FreeBSD”– GNU (GPL)– MIT– Mozilla Public License 2.0– etc
Source: http://opensource.org/licenses
Sharing code: repositories
• Github• Sourgeforge• Zenodo• GigaDB/GigaGalaxy
• Versioning, sharing, collaboration, community feedback
Your environment
• How hard would it be to start from scratch?• What if you move from Ubuntu to Centos?
• IF it took you a while to set up your box, if you hesitate to set it up for your colleagues…– Create a virtual machine or ‘docker’ image that
can be shared whole. – Time-stamp of working system
Share your environment
• Virtual machine– Copy your exact environment– If it works for you, it works for anyone– Reproducibility, frozen in time
Share your environment
• Docker– ‘light’ vm – Discrete unit of code+environment– Can be called like a compiled tool
• New possibilities e.g. nucleotid.es benchmarking– Data-driven peer-review
Share your environment
• VM = black box?• Docker == black box!• http://ivory.idyll.org/blog/vms-considered-
harmful.html
Codify your environment
• Provisioning scripts are ‘research objects’• Improves adaptability (easier to recode for
alternative OS etc)• Builds in extra documentation• Easier to share – although GigaDB still wants a
compiled snapshot (i.e. full machine)
Share your pipeline
• Any analysis is a string of tools with a great many parameters
• The order of the sequence, the version of each part and the inputs and outputs are never fully explained
• These should be shared!• Help is at hand: there are many ‘workflow’
systems for this
Galaxy
Over 36,000 main Galaxy server users
Over 1,000 papersciting Galaxy use
Over 55 Galaxyservers deployed
Open source
http://galaxyproject.org
Galaxy: Under the hood
<tool name=”myfunction”> <command> python myfunction input1 </command> <inputs> <param format=”txt” name=”input1”> </inputs> <outputs> <data format=”csv” name=”output1”> </outputs></tool>
Basic xml 'wrapper'
Describe inputs and outputs
Calls command
Monitors for output
Logs/returns to 'history'
Galaxy Toolshed
https://toolshed.g2.bx.psu.edu/
Many 'omics, stats,
visualisations
2700+ tools!
Download;Run instantly
Share outputs – intermediate results
• Workflow systems help with this• If a part of your analysis can’t be replicated
– Requires a license– Is no longer compatible – Just plain won’t work
• The rest of the analysis can still be used• (show diagram)
Share outputs – codify publication
• KnitR e.g. http://www.gigasciencejournal.com/content/3/1/3
• Options given here: http://www.gigasciencejournal.com/content/3/1/19– R: KnitR, Sweave, R-Markdown– Javascript: Tangle, Active Markdown (CoffeeScript)– Python: Ipython Notebooks – iReport links this functionality for Galaxy
Research objects
• Project proposal • Project experimental SOPs • Images of equipment, subjects, conditions• RAW data• Meta-data• Analysis code, parameters, pipelines• Analysis environment, VM or provisioning script• Intermediate results• Publication figures/images/tables: codify• Publication text