The beauty of workflows and models
-
Upload
mygrid-team -
Category
Science
-
view
124 -
download
1
description
Transcript of The beauty of workflows and models
The beauty of workflows and models
Workflows for research. Reproducible research.
Professor Carole GobleThe University of Manchester, UK
The Software Sustainability [email protected]
@caroleannegobleRDMF Meeting, Westminster, 20 June 2014
Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct
…..papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension
Jill Mesirov Accessible Reproducible Research
Science 22 Jan 2010: 327(5964): 415-416 DOI: 10.1126/science.1179653
Virtual Witnessing*
*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.
Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct
…..papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension
Jill Mesirov Accessible Reproducible Research
Science 22 Jan 2010: 327(5964): 415-416 DOI: 10.1126/science.1179653
Virtual Witnessing*
*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.
“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment, [the complete data] and the complete set of instructions which generated the figures.” David Donoho, “Wavelab and Reproducible Research,” 1995
datasetsdata collectionsstandard operating proceduressoftwarealgorithmsconfigurationstools and appscodesworkflowsscriptscode librariesservices,system software infrastructure, compilershardwareMorin et al Shining Light into Black
BoxesScience 13 April 2012: 336(6078) 159-160
Ince et al The case for open computer programs, Nature 482, 2012
datasetsdata collectionsstandard operating proceduressoftwarealgorithmsconfigurationstools and appscodesworkflowsscriptscode librariesservices,system software infrastructure, compilershardwareMorin et al Shining Light into Black
BoxesScience 13 April 2012: 336(6078) 159-160
Ince et al The case for open computer programs, Nature 482, 2012
“Executable Data”
Biodiversity marine monitoring and health assessment
ecological niche modelling
Data Intensive ScienceCollaborative Science
Pilumnus hirtellusEnclosed sea problem (Ready et al., 2010)
Sarah Bourlat http://www.biovel.eu
Data discoveryData discovery
Data assembly, cleaning, and refinement
Data assembly, cleaning, and refinement
Ecological Niche Modeling
Ecological Niche Modeling
Statistical analysisStatistical analysis
Analytical cycle
Data collectionData collection
InsightsInsights Scholarly Communication & Reporting
Scholarly Communication & Reporting
BioSTIF
method
instruments and laboratory
Workflows: capture the stepsassembly &
interoperabilityshielding & optimisingflexible variant reusepipelines & explorationrepetition & comparisonrecord & set-up provenance collectionreport & embed
multi-code and multi-resource experiments
in-house and externalworkflow mgment
systems
materials
http://www.taverna.org.uk
Application
Genera
list
Sp
eci
alis
t
Infrastructure
Scientific Workflow Management Systems
Systems Biology
Modelling Cycle
Virtual Physiological Human Morphology
Microbiology Metabolic Pathways
http://www.vph-share.eu/
standard
s, standard
s, sta
ndard
s
Data
Models
Articles
ExternalDatabases
http://www.seek4science.org
Metadata
http://www.isatools.org
Aggregated Content Infrastructure
share and interlinking multi-stewarded, mixed, methods, models, data,
samples…
Preservation Planning & Watch
continuous preservation management
Environment and users
Repository
accessingestharvest
Monitored environment and users
Watch
Planning
Operations
create/ reevaluate plans
deploy plan
monitored actions
Monitored content and
events
execute action plan
policies
SCOUT
c3po
PLATO Taverna
workflows
RODA
Long term preservation of digital data. Maintaining scans of newspapers, books, records of data; Metadata maintenance; large and automated.Preservation Policy: Collection level Control Policy: Low level actions & constraints
http://www.scape-project.eu/
Merge a Preservation Action Plan….
… with an Access Workflow
Execution Workflow
Preservation Planning & Watch
Publish
Use
Components
Workflows (and Scripts and Models)
are….…provenance of data…general technique for describing and enacting a process…precise, unambiguous, transparent protocols and
records.…often complex, so they need explaining.…often challenging and expensive to develop.…know-how and best practice. …collaborations.
…first class citizens of research…support the process of research
Workflow publishing
[Scott Edmunds]
Publishing
Journals
Portals
Integrative Frameworks
galaxyproject.org/
Reproducibility = Hard Work
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>5000 downloads
http://homolog.us/wiki/index.php?title=SOAPdenovo2
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18>11000 accesses
Open-Code
8 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-PipelinesOpen-Workflows
DOI:10.5524/100038Open-Data
78GB CC0 data
Enabled code to being picked apart by bloggers in wiki
[Scott Edmunds]
http://ropensci.org
RepositoriesLibrariesRegistries
ArchivingPublishingComponent LibrariesPreserving
RecordingStoringExchangingVersioningSharingPACKS
Repositories
Data Operations in Workflows in the Wild
Analysis of 260 publicly available workflows in Taverna, WINGS, Galaxy and VistrailsGarijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, FGCS, 36, July 2014, 338–351
Research Method StewardshipManagement, Publishing,
PreservationWorkflows & Scripts
Services & Codes
Standard Operating Procedures
DescriptionsStandards
Porta
l
Different systemsFormats
Web ServicesCode LibrariesExecutables
Models &Algorithms
Mark-up Languages, Mathematical descriptionsStandards
BioModels
Reuse
Organised Groups
Trust
ReciprocityVisibility
Roll your own from standard parts
Complementarity
Design and Instruction
Curation
Non-intrusive, Non-invasive, Not invisible
Enclaves
Specialist Flirts
Blue collar
IncrementalJIJIT not JIC
Victoria Stodden, AMP 2011 http://www.stodden.net/AMP2011/,
Special Issue Reproducible Research Computing in Science and Engineering July/August 2012, 14(4)Howison and Herbsleb (2013) "Incentives and Integration In Scientific Software Production" CSCW 2013.
Data Stewardship
Making practicesSustainabilityManagement planningDepositionLong term access CreditJournalsLicensingOpen source / access
Best Practices for Scientific Computing http://arxiv.org/abs/1210.05301st Workshop on Maintainable Software Practices in e-Science – e-Science 2012Stodden, Reproducible Research Standard, Intl J Comm Law & Policy, 13 2009
Software
ModelsWorkflows
Services
Data Stewardship
Best Practices for Scientific Computing http://arxiv.org/abs/1210.05301st Workshop on Maintainable Software Practices in e-Science – e-Science 2012Stodden, Reproducible Research Standard, Intl J Comm Law & Policy, 13 2009
Software
ModelsWorkflows
Services
Making practicesSustainabilityManagement planningDepositionLong term access CreditJournalsLicensingOpen source / access
http://sciencecodemanifesto.org/http://matt.might.net/articles/crapl/
Jennifer Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012
Software release paradigmSome of your data isn’t data
Not a static document paradigm
• Release research
• Methods in motion.
• Versioning• Forks &
merges• F1000, PeerJ
GitHub….
Pivot around method / software / data
rather than paperCitation semantics: software as was?
software as is?
The multi-dimensional paper
methods,reproducibili
ty
what does itmean for content
managers and the research
workflow?
Replication Gap
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
Out of 18 microarray papers, results
from 10 could not be reproduced
Out of 18 microarray papers, results
from 10 could not be reproduced
re-compute
replicatererun
repeat
re-examine
repurpose
recreate
reuse
restore
reconstruct review
regeneraterevise
recycle
regenerate the figure
redo
“When I use a word," Humpty Dumpty said in rather a scornful tone, "it means just what I choose it to mean - neither more nor less.”*
*Lewis Carroll, Through the Looking-Glass, and What Alice Found There (1871)
reusereproduce
repeat replicate
Drummond C Replicability is not Reproducibility: Nor is it Good Science, onlinePeng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
Methods(techniques, algorithms, spec. of the steps)
Materials(datasets, parameters, algorithm seeds)
ExperimentInstruments(codes, services, scripts, underlying libraries)
Laboratory(sw and hw infrastructure, systems software, integrative platforms)
Setup
same experimentsame set upsame lab
same experimentsame set updifferent lab
same experimentdifferent set up
different experiment
some of same
validate
Drummond C Replicability is not Reproducibility: Nor is it Good Science, onlinePeng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
reusereproduce
repeat replicate
DesignDesign
ExecutionExecution
Result AnalysisResult Analysis
CollectionCollection
Publish / Report
Publish / Report
Peer Review
Peer Review
Peer ReusePeer Reuse
ModellingModelling
Can I repeat & defend my method?
Can I review / reproduce and compare my results / method with your results /
method?
Can I review / replicate and certify
your method?
Can I transfer your results into my
research and reuse this method?
* Adapted from Mesirov, J. Accessible Reproducible Research Science 327(5964), 415-416 (2010)
Research Report
PredictionPrediction
MonitoringMonitoring
CleaningCleaning
Record Everything
Automate Everything
recomputation.org
sciencecodemanifesto.org
[Adapted Freire, 2013]
AuthoringExec. PapersLink docs to experiment
Sweave
ProvenanceTracking,Versioning
Replay, Record, Repair
Workflows, makefiles
ProvStore
openaccessibleavailable
descriptionintelligiblemachine-readable
provenancegather
dependenciescapture stepstrack & keep
results
provenancegather
dependenciescapture stepstrack & keep
results
http://nbviewer.ipython.org/github/myGrid/DataHackLeiden/blob/alan/Player_example.ipynb
https://www.youtube.com/watch?v=QVQwSOX5S08 ?
notebooks
Build into the workflows of research….
RDataTracker and DDG Explorer
Build into the workflows of research….
[Barbara S. Lerner and Emery R. Boose]
ComponentsDependencies
Change• 35 kinds of annotations• 5 Main Workflows• 14 Nested Workflows• 25 Scripts• 11 Configuration files• 10 Software dependencies • 1 Web Service • Dataset: 90 galaxies
observed in 3 bands • Multiple platforms• Multiple systems
José Enrique Ruiz (IAA-CSIC)
Galaxy Luminosity Profiling
specialist codes libraries, platforms, tools
services
(cloud) hosted services
commodity platforms
data collectionscatalogues software
repositories
my datamy processmy codes
integrative frameworks
gateways
Document vs Instrument
Reproducibility by InspectionRead It
Reproducibility by InvocationRun It
Instrument Entropy all experiments become less reproducible
Zhao, Gomez-Perez, Belhajjame, Klyne, Garcia-Cuesta, Garrido, Hettne, Roos, De Roure and Goble. Why workflows break - Understanding and combating decay in Taverna workflows, 8th Intl Conf e-Science 2012
MitigateDetect, RepairPreservePartial replicationApprox reproduceVerificationBenchmarks
Environmental Ecosystem
Joppa et al SCIENCE 340 May 2013; Morin et al Science 336 2012
Black boxesMixed systems, mixed stewardship
Distributed, hosted systems
Workflow Planning & Watch of Workflows
Watch
Operations
Planning
Env & Users
Repository
plan
deploy
monitor monitor
monitor
accessingest,harvest
Decay, Service Deprecation,Data source monitoring, Checklists,Minimal Models
Workflows, myExperiment
Workflows for managing workflows
portability
variability tolerance
[Adapted Freire, 2013]
preservationpackaging
provenancegather
dependenciescapture stepstrack & keep
results
provenancegather
dependenciescapture stepstrack & keep
results
versioning
host
service
Open Source/Store
Sci as a ServiceIntegrative fws
Virtual MachinesRecompute, limited installation, Black BoxByte execution, copiesDescriptive read,White BoxArchived record
Read & Run, Co-locationNo installation
Portable PackageWhite Box, Installation Archived record
portability
[Adapted Freire, 2013]
preservationpackaging
provenancegather
dependenciescapture stepstrack & keep
results
provenancegather
dependenciescapture stepstrack & keep
resultshost
service
ReproZip
variability tolerance
versioning
Levels of Reproducibility
Coverage: how much of an experiment is reproducible
Ori
gin
al Experi
ment S
imila
r Experi
ment D
iffere
nt
Experi
ment
Port
ab
ility
Depth: how much of an experiment is available
Binaries + Data
Source Code / Workflow+ Data
Binaries + Data + Dependencies
Source Code / Workflow+ Data + Dependencies
Virtual MachineBinaries + Data + Dependencies
Virtual MachineSource Code / Workflow+ Data + Dependencies
Figures + Data
[Freire, 2014]
indable
ccessible
nteroperable
eusablehttp://datafairport.org/
Packs
(Dynamic) Research Objects• Bundle and relate multi-hosted digital resources of a scientific
experiment or investigation using standard mechanisms, Currency of exchange
• Exchange, Releasing paradigm for publishing
http://www.researchobject.org/
• Bundle and relate multi-hosted digital resources of a scientific experiment or investigation using standard mechanisms, Currency of exchange
• Exchange, Releasing paradigm for publishing
http://www.researchobject.org/
(Dynamic) Research Objects
• Bundle and relate multi-hosted digital resources of a scientific experiment or investigation using standard mechanisms, Currency of exchange
• Exchange, Releasing paradigm for publishing
http://www.researchobject.org/
(Dynamic) Research Objects
OAI-ORE
W3C OAM
H. Van de Sompel et. al. Persistent Identifiers for Scholarly Assets and the Web: The Need for an Unambiguous Mapping 9th International Digital Curation Conference; Trusty URLs
H. Van de Sompel et. al. Persistent Identifiers for Scholarly Assets and the Web: The Need for an Unambiguous Mapping 9th International Digital Curation Conference; Trusty URLs
Machine readable metadata*
Machine actionable systems**
* Especially Linked Data and RDF ** Especially REST APIs
Sys Bio Research Object
Adobe UCF
Research Object Bundle
ORE PROVODF
• Aggregation• Annotations/
provenance• Ad-hoc domain-
specific specification
OMEX archive
Systems Biology:Needed a common archive format for reuse across tools
The research lifecycle
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
AuthoringTools
Lab Notebooks
DataCapture
SoftwareRepositories
Analysis Tools
Visualization
ScholarlyCommunication
Commercial &Public Tools
Git-likeResources
By Discipline
Data JournalsDiscipline-
Based MetadataStandards
Community Portals
Institutional Repositories
New Reward Systems
Commercial Repositories
Training
[Phil Bourne]
productivity
reproducibility
personalside effect
publicside effect
The Cameron Neylon Equation
towards
Steps
born
reproducible
“may all your problems be technical” ...Jim Gray
SocialMatters
Organisation
MetricsCulture
Process
[Adapted, Daron Green]
Summary• Workflow & modelling models in Science• Software-style Stewardship• Born reproducible• Collective cost & responsibility• Social factors dominate
http://www.force11.org
Force201512 - 13 January, 2015Oxford University
• myGrid– http://www.mygrid.org.uk
• Taverna– http://www.taverna.org.uk
• myExperiment– http://www.myexperiment.org
• BioCatalogue– http://www.biocatalogue.org
• Biodiversity Catalogue– http://www.biodiversitycatalogue.org
• Seek– http://www.seek4science.org
• Rightfield– http://www.rightfield.org.uk
• VPH-Share– http://www.vph-share.eu/
• Wf4ever– http://www.wf4ever-project.org
• Software Sustainability Institute– http://www.software.ac.uk
• BioVeL– http://www.biovel.eu
• Force11– http://www.force11.org
• SCAPE– http://www.scape-project.eu/
Acknowledgements• David De Roure• Tim Clark• Sean Bechhofer• Robert Stevens• Christine Borgman • Victoria Stodden• Marco Roos• Jose Enrique Ruiz del Mazo• Oscar Corcho• Ian Cottam• Steve Pettifer• Magnus Rattray• Chris Evelo• Katy Wolstencroft• Robin Williams• Pinar Alper• C. Titus Brown• Greg Wilson• Kristian Garza• Donal Fellows
• Wf4ever, SysMO, BioVel, UTOPIA and myGrid teams
• Juliana Freire• Jill Mesirov• Simon Cockell• Paolo Missier• Paul Watson• Gerhard Klimeck• Matthias Obst• Jun Zhao• Pinar Alper• Daniel Garijo• Yolanda Gil• James Taylor• Alex Pico• Sean Eddy• Cameron Neylon• Barend Mons• Kristina Hettne• Stian Soiland-Reyes• Rebecca Lawrence• Alan Williams