Converting scripts into reproducible workflow research objects

Post on 13-Apr-2017

55 views 0 download

Transcript of Converting scripts into reproducible workflow research objects

Converting Scripts into Reproducible Workflow Research Objects

Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros

lucas.carvalho@ic.unicamp.br

Baltimore, Maryland, USA

October 23-26, 2016

2

Background and Motivation

● Data-Intensive Experiments– Collection of scripts, programs and (big) data

Papers

3

Background and Motivation

● Data-Intensive Experiments– Collection of scripts, programs and (big) data

Papers

How to understand, reproduce or reuse data and models of experiments?

4

Background and Motivation

● Data-Intensive Experiments– Collection of scripts, programs and (big) data

Manual collection and organization of data provenance

Papers

How to understand, reproduce or reuse data and models of experiments?

5

Background and Motivation

● Script-based experiments

What are the inputs and outputs?

How to change this local program for a similar web service?

Example of script code.

Difficult to understand, to reuse, and to reproduce.

6

Background and Motivation

● Scientific Workflows

Example of Scientific Workflow Management System.

7

CreateUnderstandReuseReproduce

Overview

8

CreateUnderstandReuseReproduce

Overview

+

9

CreateUnderstandReuseReproduce

Overview

+

Step 2

Step 1

Step 3

Step 4

Step 5

Methodology

10

Related Work

● Script-language specific.● Workflow-engine specific.● A new language is needed.● Outcome is not an executable workflow.● Do not collect provenance data of the

conversion process.

11

Two Kind of Experts

● Scientists– Domain experts who understand the experiment, and

the script (sometimes called user);

● Curators: – Scientists who are also familiar with workflow and

script programming or;

– Computer scientists who are familiar enough with the domain to be able to implement our methodology;

– Responsible for authoring, documenting and publishing workflows and associated resources.

12

Requirements

● Produce workflow-like view of the script.● Create an executable workflow and compare

execution of workflow and script.● Modify the workflow resources.● Record provenance data.● Aggregate all resources to support

Reproducibility and Reuse.

1

2

3

4

5

13

Requirements

● Produce workflow-like view of the script.1

Activity 1

Port 1 Port 2 Port 3

Port 1 Port 2

Activity 2

Port 3

Port 3

Activity n

Port n

Script-based experiment.

Abstract workflow.

14

Requirements

● Create executable workflow and compare execution of workflow and script.

2

Executable workflow. Script-based experiment.

15

Requirements

● Modify the workflow resources.3

Local

(a)

(b)

Algorithm A Algorithm B

16

Requirements

● Record provenance data4

Activity 1

Output 1 Output 2

wasGeneratedBy wasGeneratedBy

Sampleused

“2012-06-01”

wasStartedAt

Activity 2used

LucasWorkflowRun

wasAssociatedWith

used

17

Requirements

● Aggregate all resources to support Reproducibility and Reuse.

5

Abstract workflows

Concrete workflows

Annotations

Papers and Reports

Provenance

Authors

Scripts

Data

18

Script

Generate Abstract Workflow

Generate Abstract Workflow

Create an executable workflow

Create an executable workflow

Refine workflowRefine workflow

Bundle Resources into a Research Object

Bundle Resources into a Research Object

Annotate and check qualityAnnotate and check quality

Abstract workflow

Concrete workflow

2

1

3

4

5

Methodology

19

Workflow Research Object (WRO)

● Research Objects are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations.

● WROs encapsulate scientific workflows and additional information regarding their context and resources. Research Object Model

20

Running Example

● Molecular Dynamics Simulations– Many branches of material sciences, computational

engineering, physics and chemistry.

– Scripts (shell script), programs (NAMD, VMD, Fortran)

– Phases: set up, simulation and analysis of trajectories.

– Inputs: protein structure, simulation parameters and force field files.

– Output: trajectories and analysis results.

21

StepGenerate Abstract Workflow

1

Script code.

22

StepGenerate Abstract Workflow

1

Manuallyannotate

Script code.Annotated script code.

23

StepGenerate Abstract Workflow

1

Manuallyannotate

Createworkflow-like

view

Script code.Annotated script code.

Abstract workflow.

24

StepGenerate Abstract Workflow

1

code blocks

Input/ouput

YesWorkflowMcPhillips et. al, 2015

- Code comments- Tags: ● @begin● @end● @desc● @in● @out● ...

T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language-independent tool for recovering workflow information from scripts,” International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015.

CreateWorkflow-like

view

Abstract workflow.

Annotated script code.

25

StepGenerate Abstract Workflow

1

CreateWorkflow-like

view

Abstract workflow.

Annotated script code.

26

StepCreate an executable workflow

2

Abstract workflow.

27

StepCreate an executable workflow

2

Create implementationof activities

Copy code blocks from the script.

Abstract workflow.

Executable workflow.

28

StepCreate an executable workflow

2

Create implementationof activities

Copy code blocks from the script.

Abstract workflow.

Executable workflow.

29

StepCreate an executable workflow

2

Create implementationof activities

Copy code blocks from the script.

Abstract workflow.

Executable workflow.Script code.

30

StepRefine executable workflow

3

Modify resources:● Algorithms● Data Sets● Parallelization● Web Services● ...

Executable workflow.New workflow version.

31

StepRefine executable workflow

3

Create newversion

Modify resources:● Algorithms● Data Sets● Parallelization● Web Services● ...

Executable workflow.New workflow version.

32

Steps

Record provenance data: execution traces.

2 3

wasEnactedBy

split

Output 1 Output 2wasGeneratedBy wasGeneratedBy

Sampleused

“2012-06-01”

wasStartedAt

psgen

used

LucasWorkflowRun

wasAssociatedWith

usedhasSpecification

W3C PROV

Executable workflow.

33

Steps

Record provenance data: conversion process.

2 3

wasDerivedFrom

wasDerivedFrom

wasDerivedFrom

wasAssociatedWith

CuratorCurator

W3C PROV

Executable workflow.New workflow version.

Script code.

34

StepAnnotate and check quality

● Annotations describing the workflow.● Use provenance data

– To check the quality of the conversion process.

● Run checks to verify the soundness of the workflow.

4

35

StepAnnotate and check quality

4

Script code.

Executable workflow.

36

StepAnnotate and check quality

4

Workflow version.

Initial Executable workflow.

37

StepAnnotate and check quality

● Common mistakes during the conversion:– not clearly identified the main logical processing

units in the script;

– a mistake when migrating script code into the corresponding activity;

– not provided the correct input files and parameters;

– the coding of the workflow itself contained errors.

4

38

StepBundle Resources into a Research Object

5

Script Abstract workflow

Concrete workflow(s)

Annotations

Paper

ProvenanceData

Attributions

39

Contributions

● A methodology that guides curators in a principled manner to transform scripts into reproducible and reusable WRO;

● This addresses an important issue in the area of script provenance;

40

Conclusions

● We addressed issues wrt understanding, reuse and reproducibility of script-based experiments.

● The methodology created was:– elaborated based on requirements;

– showcased via a real world use case from the field of Molecular Dynamics;

● We exploited tools and standards from the scientific community:– Scientific Workflows, YesWorkflow, Research Objects, the W3C

PROV recommendations and the Web Annotation Data Model.

● The bundle is available at http://w3id.org/w2share/s2rwro/

41

Next Steps

● Evaluation using other case studies;● Evaluation of the cost of the effectiveness of

our methodology;● Extension of YesWorkflow to support the

semantic annotation of blocks;● Implementation of tools.

42

Acknowledgments

● FAPESP (grant # 2014/23861-4)● CCES/CEPID (grant # 2013/08293-7)

– Center for Computational Engineering & Sciences

● LIS (Laboratory of Information Systems)● Prof. Munir Skaf and his group from Institute of

Chemistry - Unicamp.

Converting Scripts into Reproducible Workflow Research Objects

Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros

lucas.carvalho@ic.unicamp.br

Baltimore, Maryland, USA

October 23-26, 2016