The beauty of workflows and models

63
The beauty of workflows and models Workflows for research. Reproducible research. Professor Carole Goble The University of Manchester, UK The Software Sustainability Institute [email protected] @caroleannegoble RDMF Meeting, Westminster, 20 June 2014

description

Carole Goble at RDMFMeeting, Westminster, 2014-06-20. Workflows for research. Reproducible research.

Transcript of The beauty of workflows and models

Page 1: The beauty of workflows and models

The beauty of workflows and models

Workflows for research. Reproducible research.

Professor Carole GobleThe University of Manchester, UK

The Software Sustainability [email protected]

@caroleannegobleRDMF Meeting, Westminster, 20 June 2014

Page 2: The beauty of workflows and models

Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct

…..papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension

Jill Mesirov Accessible Reproducible Research

Science 22 Jan 2010: 327(5964): 415-416 DOI: 10.1126/science.1179653

Virtual Witnessing*

*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.

Page 3: The beauty of workflows and models

Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct

…..papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension

Jill Mesirov Accessible Reproducible Research

Science 22 Jan 2010: 327(5964): 415-416 DOI: 10.1126/science.1179653

Virtual Witnessing*

*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.

Page 4: The beauty of workflows and models

“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment, [the complete data] and the complete set of instructions which generated the figures.” David Donoho, “Wavelab and Reproducible Research,” 1995

datasetsdata collectionsstandard operating proceduressoftwarealgorithmsconfigurationstools and appscodesworkflowsscriptscode librariesservices,system software infrastructure, compilershardwareMorin et al Shining Light into Black

BoxesScience 13 April 2012: 336(6078) 159-160

Ince et al The case for open computer programs, Nature 482, 2012

Page 5: The beauty of workflows and models

datasetsdata collectionsstandard operating proceduressoftwarealgorithmsconfigurationstools and appscodesworkflowsscriptscode librariesservices,system software infrastructure, compilershardwareMorin et al Shining Light into Black

BoxesScience 13 April 2012: 336(6078) 159-160

Ince et al The case for open computer programs, Nature 482, 2012

“Executable Data”

Page 6: The beauty of workflows and models

Biodiversity marine monitoring and health assessment

ecological niche modelling

Data Intensive ScienceCollaborative Science

Pilumnus hirtellusEnclosed sea problem (Ready et al., 2010)

Sarah Bourlat http://www.biovel.eu

Page 7: The beauty of workflows and models
Page 8: The beauty of workflows and models

Data discoveryData discovery

Data assembly, cleaning, and refinement

Data assembly, cleaning, and refinement

Ecological Niche Modeling

Ecological Niche Modeling

Statistical analysisStatistical analysis

Analytical cycle

Data collectionData collection

InsightsInsights Scholarly Communication & Reporting

Scholarly Communication & Reporting

Page 9: The beauty of workflows and models

BioSTIF

method

instruments and laboratory

Workflows: capture the stepsassembly &

interoperabilityshielding & optimisingflexible variant reusepipelines & explorationrepetition & comparisonrecord & set-up provenance collectionreport & embed

multi-code and multi-resource experiments

in-house and externalworkflow mgment

systems

materials

http://www.taverna.org.uk

Page 10: The beauty of workflows and models

Application

Genera

list

Sp

eci

alis

t

Infrastructure

Scientific Workflow Management Systems

Page 11: The beauty of workflows and models

Systems Biology

Modelling Cycle

Page 12: The beauty of workflows and models

Virtual Physiological Human Morphology

Microbiology Metabolic Pathways

http://www.vph-share.eu/

Page 13: The beauty of workflows and models

standard

s, standard

s, sta

ndard

s

Page 14: The beauty of workflows and models

Data

Models

Articles

ExternalDatabases

http://www.seek4science.org

Metadata

http://www.isatools.org

Aggregated Content Infrastructure

share and interlinking multi-stewarded, mixed, methods, models, data,

samples…

Page 15: The beauty of workflows and models

Preservation Planning & Watch

continuous preservation management

Environment and users

Repository

accessingestharvest

Monitored environment and users

Watch

Planning

Operations

create/ reevaluate plans

deploy plan

monitored actions

Monitored content and

events

execute action plan

policies

SCOUT

c3po

PLATO Taverna

workflows

RODA

Long term preservation of digital data. Maintaining scans of newspapers, books, records of data; Metadata maintenance; large and automated.Preservation Policy: Collection level Control Policy: Low level actions & constraints

http://www.scape-project.eu/

Page 16: The beauty of workflows and models

Merge a Preservation Action Plan….

… with an Access Workflow

Execution Workflow

Preservation Planning & Watch

Publish

Use

Components

Page 17: The beauty of workflows and models

Workflows (and Scripts and Models)

are….…provenance of data…general technique for describing and enacting a process…precise, unambiguous, transparent protocols and

records.…often complex, so they need explaining.…often challenging and expensive to develop.…know-how and best practice. …collaborations.

…first class citizens of research…support the process of research

Page 18: The beauty of workflows and models

Workflow publishing

[Scott Edmunds]

Publishing

Journals

Portals

Integrative Frameworks

galaxyproject.org/

Page 19: The beauty of workflows and models

Reproducibility = Hard Work

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>5000 downloads

http://homolog.us/wiki/index.php?title=SOAPdenovo2

Data sets

Analyses

Linked to

Linked to

DOI

DOI

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18>11000 accesses

Open-Code

8 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-PipelinesOpen-Workflows

DOI:10.5524/100038Open-Data

78GB CC0 data

Enabled code to being picked apart by bloggers in wiki

[Scott Edmunds]

Page 20: The beauty of workflows and models

http://ropensci.org

RepositoriesLibrariesRegistries

Page 21: The beauty of workflows and models

ArchivingPublishingComponent LibrariesPreserving

RecordingStoringExchangingVersioningSharingPACKS

Repositories

Page 22: The beauty of workflows and models

Data Operations in Workflows in the Wild

Analysis of 260 publicly available workflows in Taverna, WINGS, Galaxy and VistrailsGarijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, FGCS, 36, July 2014, 338–351

Page 23: The beauty of workflows and models

Research Method StewardshipManagement, Publishing,

PreservationWorkflows & Scripts

Services & Codes

Standard Operating Procedures

DescriptionsStandards

Porta

l

Different systemsFormats

Web ServicesCode LibrariesExecutables

Models &Algorithms

Mark-up Languages, Mathematical descriptionsStandards

BioModels

Page 24: The beauty of workflows and models

Reuse

Organised Groups

Trust

ReciprocityVisibility

Roll your own from standard parts

Complementarity

Design and Instruction

Page 25: The beauty of workflows and models

Curation

Non-intrusive, Non-invasive, Not invisible

Enclaves

Specialist Flirts

Blue collar

IncrementalJIJIT not JIC

Page 26: The beauty of workflows and models

Victoria Stodden, AMP 2011 http://www.stodden.net/AMP2011/,

Special Issue Reproducible Research Computing in Science and Engineering July/August 2012, 14(4)Howison and Herbsleb (2013) "Incentives and Integration In Scientific Software Production" CSCW 2013.

Page 27: The beauty of workflows and models

Data Stewardship

Making practicesSustainabilityManagement planningDepositionLong term access CreditJournalsLicensingOpen source / access

Best Practices for Scientific Computing http://arxiv.org/abs/1210.05301st Workshop on Maintainable Software Practices in e-Science – e-Science 2012Stodden, Reproducible Research Standard, Intl J Comm Law & Policy, 13 2009

Software

ModelsWorkflows

Services

Page 28: The beauty of workflows and models

Data Stewardship

Best Practices for Scientific Computing http://arxiv.org/abs/1210.05301st Workshop on Maintainable Software Practices in e-Science – e-Science 2012Stodden, Reproducible Research Standard, Intl J Comm Law & Policy, 13 2009

Software

ModelsWorkflows

Services

Making practicesSustainabilityManagement planningDepositionLong term access CreditJournalsLicensingOpen source / access

Page 29: The beauty of workflows and models

http://sciencecodemanifesto.org/http://matt.might.net/articles/crapl/

Page 30: The beauty of workflows and models

Jennifer Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012

Software release paradigmSome of your data isn’t data

Not a static document paradigm

• Release research

• Methods in motion.

• Versioning• Forks &

merges• F1000, PeerJ

GitHub….

Page 31: The beauty of workflows and models

Pivot around method / software / data

rather than paperCitation semantics: software as was?

software as is?

The multi-dimensional paper

Page 32: The beauty of workflows and models

methods,reproducibili

ty

what does itmean for content

managers and the research

workflow?

Page 33: The beauty of workflows and models

Replication Gap

1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

Out of 18 microarray papers, results

from 10 could not be reproduced

Out of 18 microarray papers, results

from 10 could not be reproduced

Page 34: The beauty of workflows and models

re-compute

replicatererun

repeat

re-examine

repurpose

recreate

reuse

restore

reconstruct review

regeneraterevise

recycle

regenerate the figure

redo

“When I use a word," Humpty Dumpty said in rather a scornful tone, "it means just what I choose it to mean - neither more nor less.”*

*Lewis Carroll, Through the Looking-Glass, and What Alice Found There (1871)

Page 35: The beauty of workflows and models

reusereproduce

repeat replicate

Drummond C Replicability is not Reproducibility: Nor is it Good Science, onlinePeng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.

Methods(techniques, algorithms, spec. of the steps)

Materials(datasets, parameters, algorithm seeds)

ExperimentInstruments(codes, services, scripts, underlying libraries)

Laboratory(sw and hw infrastructure, systems software, integrative platforms)

Setup

Page 36: The beauty of workflows and models

same experimentsame set upsame lab

same experimentsame set updifferent lab

same experimentdifferent set up

different experiment

some of same

validate

Drummond C Replicability is not Reproducibility: Nor is it Good Science, onlinePeng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.

reusereproduce

repeat replicate

Page 37: The beauty of workflows and models

DesignDesign

ExecutionExecution

Result AnalysisResult Analysis

CollectionCollection

Publish / Report

Publish / Report

Peer Review

Peer Review

Peer ReusePeer Reuse

ModellingModelling

Can I repeat & defend my method?

Can I review / reproduce and compare my results / method with your results /

method?

Can I review / replicate and certify

your method?

Can I transfer your results into my

research and reuse this method?

* Adapted from Mesirov, J. Accessible Reproducible Research Science 327(5964), 415-416 (2010)

Research Report

PredictionPrediction

MonitoringMonitoring

CleaningCleaning

Page 38: The beauty of workflows and models

Record Everything

Automate Everything

recomputation.org

sciencecodemanifesto.org

Page 39: The beauty of workflows and models

[Adapted Freire, 2013]

AuthoringExec. PapersLink docs to experiment

Sweave

ProvenanceTracking,Versioning

Replay, Record, Repair

Workflows, makefiles

ProvStore

openaccessibleavailable

descriptionintelligiblemachine-readable

provenancegather

dependenciescapture stepstrack & keep

results

provenancegather

dependenciescapture stepstrack & keep

results

Page 40: The beauty of workflows and models

http://nbviewer.ipython.org/github/myGrid/DataHackLeiden/blob/alan/Player_example.ipynb

https://www.youtube.com/watch?v=QVQwSOX5S08 ?

notebooks

Build into the workflows of research….

Page 41: The beauty of workflows and models

RDataTracker and DDG Explorer

Build into the workflows of research….

[Barbara S. Lerner and Emery R. Boose]

Page 42: The beauty of workflows and models

ComponentsDependencies

Change• 35 kinds of annotations• 5 Main Workflows• 14 Nested Workflows• 25 Scripts• 11 Configuration files• 10 Software dependencies • 1 Web Service • Dataset: 90 galaxies

observed in 3 bands • Multiple platforms• Multiple systems

José Enrique Ruiz (IAA-CSIC)

Galaxy Luminosity Profiling

Page 43: The beauty of workflows and models

specialist codes libraries, platforms, tools

services

(cloud) hosted services

commodity platforms

data collectionscatalogues software

repositories

my datamy processmy codes

integrative frameworks

gateways

Page 44: The beauty of workflows and models

Document vs Instrument

Reproducibility by InspectionRead It

Reproducibility by InvocationRun It

Page 45: The beauty of workflows and models

Instrument Entropy all experiments become less reproducible

Zhao, Gomez-Perez, Belhajjame, Klyne, Garcia-Cuesta, Garrido, Hettne, Roos, De Roure and Goble. Why workflows break - Understanding and combating decay in Taverna workflows, 8th Intl Conf e-Science 2012

MitigateDetect, RepairPreservePartial replicationApprox reproduceVerificationBenchmarks

Page 46: The beauty of workflows and models

Environmental Ecosystem

Joppa et al SCIENCE 340 May 2013; Morin et al Science 336 2012

Black boxesMixed systems, mixed stewardship

Distributed, hosted systems

Page 47: The beauty of workflows and models

Workflow Planning & Watch of Workflows

Watch

Operations

Planning

Env & Users

Repository

plan

deploy

monitor monitor

monitor

accessingest,harvest

Decay, Service Deprecation,Data source monitoring, Checklists,Minimal Models

Workflows, myExperiment

Workflows for managing workflows

Page 48: The beauty of workflows and models

portability

variability tolerance

[Adapted Freire, 2013]

preservationpackaging

provenancegather

dependenciescapture stepstrack & keep

results

provenancegather

dependenciescapture stepstrack & keep

results

versioning

host

service

Open Source/Store

Sci as a ServiceIntegrative fws

Virtual MachinesRecompute, limited installation, Black BoxByte execution, copiesDescriptive read,White BoxArchived record

Read & Run, Co-locationNo installation

Portable PackageWhite Box, Installation Archived record

Page 49: The beauty of workflows and models

portability

[Adapted Freire, 2013]

preservationpackaging

provenancegather

dependenciescapture stepstrack & keep

results

provenancegather

dependenciescapture stepstrack & keep

resultshost

service

ReproZip

variability tolerance

versioning

Page 50: The beauty of workflows and models

Levels of Reproducibility

Coverage: how much of an experiment is reproducible

Ori

gin

al Experi

ment S

imila

r Experi

ment D

iffere

nt

Experi

ment

Port

ab

ility

Depth: how much of an experiment is available

Binaries + Data

Source Code / Workflow+ Data

Binaries + Data + Dependencies

Source Code / Workflow+ Data + Dependencies

Virtual MachineBinaries + Data + Dependencies

Virtual MachineSource Code / Workflow+ Data + Dependencies

Figures + Data

[Freire, 2014]

Page 51: The beauty of workflows and models

indable

ccessible

nteroperable

eusablehttp://datafairport.org/

Packs

Page 52: The beauty of workflows and models

(Dynamic) Research Objects• Bundle and relate multi-hosted digital resources of a scientific

experiment or investigation using standard mechanisms, Currency of exchange

• Exchange, Releasing paradigm for publishing

http://www.researchobject.org/

Page 53: The beauty of workflows and models

• Bundle and relate multi-hosted digital resources of a scientific experiment or investigation using standard mechanisms, Currency of exchange

• Exchange, Releasing paradigm for publishing

http://www.researchobject.org/

(Dynamic) Research Objects

Page 54: The beauty of workflows and models

• Bundle and relate multi-hosted digital resources of a scientific experiment or investigation using standard mechanisms, Currency of exchange

• Exchange, Releasing paradigm for publishing

http://www.researchobject.org/

(Dynamic) Research Objects

OAI-ORE

W3C OAM

H. Van de Sompel et. al. Persistent Identifiers for Scholarly Assets and the Web: The Need for an Unambiguous Mapping 9th International Digital Curation Conference; Trusty URLs

H. Van de Sompel et. al. Persistent Identifiers for Scholarly Assets and the Web: The Need for an Unambiguous Mapping 9th International Digital Curation Conference; Trusty URLs

Page 55: The beauty of workflows and models

Machine readable metadata*

Machine actionable systems**

* Especially Linked Data and RDF ** Especially REST APIs

Page 56: The beauty of workflows and models
Page 57: The beauty of workflows and models

Sys Bio Research Object

Adobe UCF

Research Object Bundle

ORE PROVODF

• Aggregation• Annotations/

provenance• Ad-hoc domain-

specific specification

OMEX archive

Systems Biology:Needed a common archive format for reuse across tools

Page 58: The beauty of workflows and models

The research lifecycle

IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION

AuthoringTools

Lab Notebooks

DataCapture

SoftwareRepositories

Analysis Tools

Visualization

ScholarlyCommunication

Commercial &Public Tools

Git-likeResources

By Discipline

Data JournalsDiscipline-

Based MetadataStandards

Community Portals

Institutional Repositories

New Reward Systems

Commercial Repositories

Training

[Phil Bourne]

Page 59: The beauty of workflows and models

productivity

reproducibility

personalside effect

publicside effect

The Cameron Neylon Equation

towards

Steps

born

reproducible

Page 60: The beauty of workflows and models

“may all your problems be technical” ...Jim Gray

SocialMatters

Organisation

MetricsCulture

Process

[Adapted, Daron Green]

Page 61: The beauty of workflows and models

Summary• Workflow & modelling models in Science• Software-style Stewardship• Born reproducible• Collective cost & responsibility• Social factors dominate

http://www.force11.org

Force201512 - 13 January, 2015Oxford University

Page 62: The beauty of workflows and models

• myGrid– http://www.mygrid.org.uk

• Taverna– http://www.taverna.org.uk

• myExperiment– http://www.myexperiment.org

• BioCatalogue– http://www.biocatalogue.org

• Biodiversity Catalogue– http://www.biodiversitycatalogue.org

• Seek– http://www.seek4science.org

• Rightfield– http://www.rightfield.org.uk

• VPH-Share– http://www.vph-share.eu/

• Wf4ever– http://www.wf4ever-project.org

• Software Sustainability Institute– http://www.software.ac.uk

• BioVeL– http://www.biovel.eu

• Force11– http://www.force11.org

• SCAPE– http://www.scape-project.eu/

Page 63: The beauty of workflows and models

Acknowledgements• David De Roure• Tim Clark• Sean Bechhofer• Robert Stevens• Christine Borgman • Victoria Stodden• Marco Roos• Jose Enrique Ruiz del Mazo• Oscar Corcho• Ian Cottam• Steve Pettifer• Magnus Rattray• Chris Evelo• Katy Wolstencroft• Robin Williams• Pinar Alper• C. Titus Brown• Greg Wilson• Kristian Garza• Donal Fellows

• Wf4ever, SysMO, BioVel, UTOPIA and myGrid teams

• Juliana Freire• Jill Mesirov• Simon Cockell• Paolo Missier• Paul Watson• Gerhard Klimeck• Matthias Obst• Jun Zhao• Pinar Alper• Daniel Garijo• Yolanda Gil• James Taylor• Alex Pico• Sean Eddy• Cameron Neylon• Barend Mons• Kristina Hettne• Stian Soiland-Reyes• Rebecca Lawrence• Alan Williams