Workflow discovery in e-science

Post on 15-Jan-2016

29 views 0 download

Tags:

description

Workflow discovery in e-science. Antoon Goderis Peter Li Carole Goble University of Manchester, UK www.cs.man.ac.uk/~goderisa. Agenda. Web services in science Workflow re-use Workflow discovery Is workflow discovery a new problem? How do people match up workflows? - PowerPoint PPT Presentation

Transcript of Workflow discovery in e-science

Workflow discovery in e-science

Antoon Goderis Peter Li Carole Goble

University of Manchester, UK

www.cs.man.ac.uk/~goderisa

Agenda

• Web services in science

• Workflow re-use

• Workflow discovery

– Is workflow discovery a new problem?

– How do people match up workflows?

– Can we replicate the behaviour with tools?

• Conclusions

Workflows Web services

BPEL, SCUFL, MOML, VDL … descriptions

SOAP, WSDL description

Workflow engine Readily invoked

Orchestrates (Web-) services

Can be published as Web service

Science is highly distributed and connected

The Web has revolutionised science

Web services about to do the same?

Scientific workflows• e-science = supporting scientists to encode,

enact, explain and share experimental procedures featuring lots of specialised data

• Case study: bioinformatics – Understanding the DNA to behaviour link

– 3000 bio-services via the Taverna workflow editor http://mygrid.org.uk/taverna

– Re-use and repurposing of workflows

– +/- 200 Taverna workflows shared at fffff

Scientific workflows• e-science = supporting scientists to encode,

enact, explain and share experimental procedures

• Case study: bioinformatics – Understanding the DNA to life link

– 3000 bio-services via the Taverna workflow editor http://mygrid.org.uk/taverna

– Re-use and repurposing of workflow fragments

– +/- 200 Taverna workflows shared at fffff

Manchester, CS dept

Manchester Biology dept

Newcastle, CS dept

Scientific workflows• e-science = supporting scientists to encode,

enact, explain and share experimental procedures

• Case study: bioinformatics – Understanding the DNA to life link

– 3000 bio-services via the Taverna workflow editor http://mygrid.org.uk/taverna

– Re-use and repurposing of workflow fragments

– +/- 200 Taverna workflows shared at www.myExperiment.org

One + Three questions1. Can’t we just do it with ?

• Keyword search doesn’t seem to cut it

1. Is workflow discovery a new problem?

2. How do people match up workflows?

3. Can we replicate the behaviour with tools?

my current workflow myExperiment.org

my current workflow myExperiment.org

?

1. Is workflow discovery a new problem?

Service discovery Workflow discovery

Discovery goal Encapsulate found service

Edit found workflow

Matching process Match over signature

Match over signature and content (data and service flow)

Starting context Service or data Service or data or workflow

Source: survey of 21 myGrid/Taverna users

1. Is workflow discovery a new problem? Yes

Service discovery Workflow discovery

Discovery goal Encapsulate found service

Edit found workflow

Matching process Match over signature

Match over signature and content (data and service flow)

Starting context Service or data Service or data or workflow

Workflow discovery subsumes service discovery

2. How do people match up workflows?

?

3. Can we replicate the behaviour with tools?

?+

1

2

3

...

1

2

3

A user experiment with bioinformatics workflows

?+

Workflow discovery task

• Can I sensibly adapt an existing experimental procedure (workflow) with another one?

• Extend Replace

+

?

Workflow corpus

• 66 similar workflows for Graves’ disease done by single author

• 1 + 5 workflows

• Workflow diagram

• No documentation

• No annotation

1 + 5

By the experts, for the experts

• 9 bioinformaticians and 4 developers at a Taverna training day

Matching strategies

• Matching input workflow with 5 others1 2

3 4

5

?

Human on-line matching strategies!

• Traits

• Scores of attraction

• Yes or no

Matching strategy: traits

Men want.. Women want..

Short term relationship

Long term relationship

Slim Tall

Students, artists, musicians, veterinarians

Lawyers, financial execs, firemen

Blonde Hair or shaved

Medium income High income

From an analysis of 30 000

profiles

Matching strategy: scoring

Confidencelevel

Score

Percentile

www.AmIHotOrNot.com

Matching strategy: yes or no

Traits

• Predicted trait Biological subtask

Biological supertask

Shared inputs + outputs

Same service type

Shared service compositions

Shared path between intermediary input and output

Traits and score

• Predicted trait

• Score of similarity, usefulness and confidence

E.g. [1 Identical –

9 Not similar]

Biological subtask

Biological supertask

Shared inputs + outputs

Same service type

Shared service compositions

Shared path between intermediary input and output

The gold standard

?• The collection of

workflow similarity assessments

• Predictive traits, possibly interacting

1 + 5

Traits/score

2. How do people match up workflows?

• Difficulty of task

– Biological relationship very difficult for 6 out of 9

– Shape similarity difficult for 4 out of 13

– Medium confidence

• Consistency

– Inter participant disagreement on how to order biological similarity and shape similarity [Spearman rank order test]

• Predictive traits

– No one trait dominant between and within participants [Levene homogeneity of variance test]

Can we do better?

• Simpler tasks and workflows

• Taverna experienced users

• Workflow documentation and annotation

• Other factors in use, e.g. size difference

– Fix allowed factors

– Adopt black box approach: yes/no matching

Automated discovery technique• Unattributed graph matcher implementation by

Messmer and Bunke

– Sub-isomorphism detection; exponential time complexity

– DAGs and optimization for repository of graphs

• Workflows parsed as graphs

– Workflow input, workflow output andintermediate services as nodes

– Data links as edges

•probeSetid

AffyMapper_seq databaseid

Blastx

Results_Blastx

• Ranking based on

– shared nodes

– difference in size between input graph and repository graphs

Automated discovery technique

3. Can we replicate the behaviour with tools? Kind of..

Average similarity assessments across participants

?+

1 + 66

Traits/score

Current work

?+

1

2

3

...

1

2

3

12 + 21

Yes/no

Text clustering

OWL workflow ontology

Precision / recall

Graph matching

Take home • Scientists compose Web services for real – and

share their results

• Workflow discovery is a real problem, which subsumes service discovery

• A range of matching strategies and techniques apply

• Evaluation is a challenge - gold standards hard to build

• Come and play at myExperiment.org

• References at www.cs.man.ac.uk/~goderisa