Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

29
Date: 21/06/2013 Detecting common scientific workflow fragments using templates and execution provenance Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ * Ontology Engineering Group Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute K-CAP 2013. Banff, Canada

Transcript of Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

Page 1: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

Date: 21/06/2013

Detecting common scientific workflow fragments usingtemplates and execution

provenance

Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ

* Ontology Engineering GroupUniversidad Politécnica de Madrid,

Ŧ USC Information Sciences Institute

K-CAP 2013. Banff, Canada

Page 2: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

2

Overview

• Creation of abstractions from low level and high level tasks in scientific workflows.

• Approach for detecting common groups of tasks among scientific workflows.

•Discoverability, understandability, reuse and design

K-CAP 2013. Banff, Canada

Lab book

Digital Log

Laboratory Protocol (recipe)

Workflow

Experiment

Page 3: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

3

Background

• Workflows as software artifacts that capture the scientific method• Addition to paper publication• Reuse

• Existing repositories of workflows (myExperiment)• Sharing workflows• Exploring existing workflows.

• PROBLEMS to address:• Sometimes workflows are difficult to understand

• Provenance is captured at a too low level. Howcan it be generalized?

• Workflow descriptions are hard to relate to eachother.• What are the common fragments shared among

workflow templates?

http://www.myexperiment.org

K-CAP 2013. Banff, Canada

Page 4: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

4

Terminology: workflow templates“A workflow template connects the steps of the workflow together, its inputs, intermediate results and expected outputs, and defines their types and dependencies”.

• Abstract workflow template: Template with some unbound steps

• Specific workflow template: Template in which all the steps are bound to a specific service, tool or code.

K-CAP 2013. Banff, Canada

Abstract SpecificTaxonomy of components

Page 5: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

5

Terminology: workflow templates“A workflow template connects the steps of the workflow together, its inputs, intermediate results and expected outputs, and defines their types and dependencies”.

• Abstract workflow template: Template with some unbound steps

• Specific workflow template: Template in which all the steps are bound to a specific service, tool or code.

K-CAP 2013. Banff, Canada

Abstract SpecificTaxonomy of components

Problem Solving

Methods

Page 6: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

6

Terminology : Workflow execution provenance traces

Workflow execution provenance trace: structured log of the workflow execution results.• Inputs of the run• Outputs of the run• Intermediate steps resultant form the run.• Software codes used by the steps.

Porter Stemmer

Result

TF

Output

DatasetReutersTrainTestDataset

A12314

TFResultRun21-06-2013

K-CAP 2013. Banff, Canada

DataTemplate

ExecutionProcessP1

ExecutionprocessP2

Page 7: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

7

Internal Macro

K-CAP 2013. Banff, Canada

•Same sequence of steps in different parts of the workflow.

•Types of data and steps are the same.

•May or not may be found among other workflows. Local to a workflow.

Page 8: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

8

Composite Workflows

K-CAP 2013. Banff, Canada

•Same sequence of steps among different workflows.

•Types of data and steps are the same.

Page 9: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

9

Background: Motifs

•Workflow motifs catalogue [Garijo et al. 2012]: Domain independent conceptual abstractions on the workflow steps.

1. Data-oriented motifs: What kind of manipulations does the workflow have?

2. Workflow-oriented motifs: How does the workflow perform its operations?

•We aim to automatically detect two types of motifs

• Internal Macro (common sequences of steps within a workflow)

• Composite workflows (common sequences of steps among workflows)

K-CAP 2013. Banff, Canada

[Garijo et al. 2012] Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis. IEEE 8th International Conference on eScience 2012.

Page 10: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

11

Motifs: Summary

K-CAP 2013. Banff, Canada

Most popular HOW motifs: Atomic workflows, Composite Workflows and Internal Macro

Page 11: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

12

Approach

Workflow Retrieval

Common fragment detection

Result analysis

K-CAP 2013. Banff, Canada

1. Retrieval of workflow templates and execution provenance traces from a repository of workflows.

2. Algorithms to obtain the most common fragments among the workflow dataset.

3. Derivation of statistics and annotation of workflows.

Page 12: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

13

Workflow representation

•Workflows are labeled DAGs (Directed Acyclic Graphs)

• Representation for both templates and workflow execution provenance traces.

• No loops

• No conditionals

• Popular representation in data oriented scientific workflows (supported by many workflow engines).

K-CAP 2013. Banff, Canada

Page 13: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

14

Challenges: Common workflow fragment detection

K-CAP 2013. Banff, Canada

[Holder et al 1994]: L. B. Holder, D. J. Cook, and S. Djoko. Substructure Discovery in the SUBDUE System. AAAI Workshop on Knowledge Discovery, pages 169{180, 1994.

•Given a collection of workflows, which are the most common fragments?• Common sub-graphs among the collection

• Sub-graph isomorphism (NP-complete)

•We use the SUBDUE algorithm [Holder et al 1994] • Graph Grammar learning

• The rules of the grammar are the workflow fragments

• Graph based hierarchical clustering• Each cluster corresponds to a workflow fragment

• Iterative algorithm with two measures for compressing the graph:• Minimum Description Length (MDL)• Size

Page 14: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

15

How does SUBDUE work?

K-CAP 2013. Banff, Canada

ProcessType1

DatasetT1

DatasetT2

ProcessType2

DatasetT3

ProcessType3

DatasetT3

ProcessType1

DatasetT1

DatasetT2

ProcessType2

DatasetT3

DatasetT2

ProcessType2

DatasetT3

Input Graph

Page 15: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

16

How does SUBDUE work?

K-CAP 2013. Banff, Canada

ProcessType1

DatasetT1

DatasetT2

ProcessType2

DatasetT3

ProcessType3

DatasetT3

ProcessType1

DatasetT1

DatasetT2

ProcessType2

DatasetT3

DatasetT2

ProcessType2

DatasetT3

Iteration 1

Fragment1

Page 16: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

17

How does SUBDUE work?

K-CAP 2013. Banff, Canada

ProcessType1

DatasetT1

FRAG1

ProcessType3

DatasetT3

ProcessType1

DatasetT1

FRAG1

Iteration 1 result

FRAG1

Page 17: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

18

How does SUBDUE work?

K-CAP 2013. Banff, Canada

ProcessType1

DatasetT1

FRAG1

ProcessType3

DatasetT3

ProcessType1

DatasetT1

FRAG1

Iteration 2

Fragment2

FRAG1

Page 18: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

19

How does SUBDUE work?

K-CAP 2013. Banff, Canada

FRAG2

ProcessType3

DatasetT3

FRAG2

Iteration 2 result (STOP)

FRAG1

Page 19: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

20

How does SUBDUE work?

K-CAP 2013. Banff, Canada

Results:Fragment 1 (FRAG1) : Fragment 2 (FRAG2):

Occurrences: 3 times 2 times

DatasetT2

ProcessType2

DatasetT3

ProcessType1

DatasetT1

FRAG1

Page 20: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

21

Challenges: Generalization of workflows

K-CAP 2013. Banff, Canada

Workflow Retrieval

Common fragment detection

Result analysis

Workflow Generalization

Porter Stemmer

Lovins Stemmer

Term Weighting

DFTF

Stemmer

CF

Page 21: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

22

Analysis setup

Analysis performed on 22 workflow templates with 30 workflow execution provenance traces.

• Abstract and specific workflow templates

• Several workflow executions belong to the same template

• Some workflow executions had errors during the execution.

• Workflows have been manually analyzed to find motifs.

• Internal Macros

• SubWorkflows

K-CAP 2013. Banff, Canada

Page 22: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

23

Evaluation results. Internal Macro

K-CAP 2013. Banff, Canada

•Our goal is to maximize the filtered multi-step fragments.

•The algorithm finds more multi-step fragments due to the way it operates.

•A step for filtering the multi-step fragments must be applied on the obtained results (some are part of others).

Page 23: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

24

Evaluation results. Composite workflows

K-CAP 2013. Banff, Canada

•More filtered multi-step fragments are found automatically than manually.• Manual analysis affects sub-workflows.

•More than 50% of the filtered multi-step fragments overlap with the manual ones.

•The fragments found automatically have more occurrences than those found manually.

Page 24: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

25

Limitations

K-CAP 2013. Banff, Canada

Overlapping fragment may not be fully detected!!

Page 25: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

26

Conclusions & future work

•Approach for detecting commonalities among scientific workflows.

• Workflow execution provenance traces• Workflow templates

•Detection of the most common workflow fragments.•Generalization of the datasets.

Future work•Expand analysis to other domains.•Add support for other workflow systems: Taverna, Knime, GenePattern, Galaxy, Vistrails, etc.•Test other graph matching algorithms.•Optimize the algorithm by reducing the search space.

•All inputs and results are available here: http://www.oeg-upm.net/files/dgarijo/kcap2013Eval

K-CAP 2013. Banff, Canada

Page 26: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

27

Towards automatic annotation of workflows

K-CAP 2013. Banff, Canada

•Ontology for describing workflow motifs• The Workflow Motif Ontology• URL: http://purl.org/net/wf-motifs

•Ontology for linking fragments to the workflows of the dataset (Work in progress).• The Workflow Fragment Description Ontology• URL: To be announced

Page 27: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

28

Current improvements

•Testing other domains.

•Expanding the compatible workflow systems: Taverna.

•Improving the workflow representation to reduce the graph size.

K-CAP 2013. Banff, Canada

Page 28: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

29

Who are we?

• Daniel Garijo Ontology Engineering Group, UPM

• Oscar CorchoOntology Engineering Group, UPM

• Yolanda GilInformation Sciences Institute, USC

EU Wf4Ever project (270129) funded under EU FP7 (ICT- 2009.4.1). (http://www.wf4ever-project.org)

K-CAP 2013. Banff, Canada

Page 29: Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

Date: 21/06/2013

Detecting common scientific workflow fragments usingtemplates and execution

provenance

Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ

* Ontology Engineering GroupUniversidad Politécnica de Madrid,

Ŧ USC Information Sciences Institute

K-CAP 2013. Banff, Canada