Post on 15-Jan-2016
description
Workflow discovery in e-science
Antoon Goderis Peter Li Carole Goble
University of Manchester, UK
www.cs.man.ac.uk/~goderisa
Agenda
• Web services in science
• Workflow re-use
• Workflow discovery
– Is workflow discovery a new problem?
– How do people match up workflows?
– Can we replicate the behaviour with tools?
• Conclusions
Workflows Web services
BPEL, SCUFL, MOML, VDL … descriptions
SOAP, WSDL description
Workflow engine Readily invoked
Orchestrates (Web-) services
Can be published as Web service
Science is highly distributed and connected
The Web has revolutionised science
Web services about to do the same?
Scientific workflows• e-science = supporting scientists to encode,
enact, explain and share experimental procedures featuring lots of specialised data
• Case study: bioinformatics – Understanding the DNA to behaviour link
– 3000 bio-services via the Taverna workflow editor http://mygrid.org.uk/taverna
– Re-use and repurposing of workflows
– +/- 200 Taverna workflows shared at fffff
Scientific workflows• e-science = supporting scientists to encode,
enact, explain and share experimental procedures
• Case study: bioinformatics – Understanding the DNA to life link
– 3000 bio-services via the Taverna workflow editor http://mygrid.org.uk/taverna
– Re-use and repurposing of workflow fragments
– +/- 200 Taverna workflows shared at fffff
Manchester, CS dept
Manchester Biology dept
Newcastle, CS dept
Scientific workflows• e-science = supporting scientists to encode,
enact, explain and share experimental procedures
• Case study: bioinformatics – Understanding the DNA to life link
– 3000 bio-services via the Taverna workflow editor http://mygrid.org.uk/taverna
– Re-use and repurposing of workflow fragments
– +/- 200 Taverna workflows shared at www.myExperiment.org
One + Three questions1. Can’t we just do it with ?
• Keyword search doesn’t seem to cut it
1. Is workflow discovery a new problem?
2. How do people match up workflows?
3. Can we replicate the behaviour with tools?
my current workflow myExperiment.org
my current workflow myExperiment.org
?
1. Is workflow discovery a new problem?
Service discovery Workflow discovery
Discovery goal Encapsulate found service
Edit found workflow
Matching process Match over signature
Match over signature and content (data and service flow)
Starting context Service or data Service or data or workflow
Source: survey of 21 myGrid/Taverna users
1. Is workflow discovery a new problem? Yes
Service discovery Workflow discovery
Discovery goal Encapsulate found service
Edit found workflow
Matching process Match over signature
Match over signature and content (data and service flow)
Starting context Service or data Service or data or workflow
Workflow discovery subsumes service discovery
2. How do people match up workflows?
?
3. Can we replicate the behaviour with tools?
?+
1
2
3
...
1
2
3
A user experiment with bioinformatics workflows
?+
Workflow discovery task
• Can I sensibly adapt an existing experimental procedure (workflow) with another one?
• Extend Replace
+
?
Workflow corpus
• 66 similar workflows for Graves’ disease done by single author
• 1 + 5 workflows
• Workflow diagram
• No documentation
• No annotation
1 + 5
By the experts, for the experts
• 9 bioinformaticians and 4 developers at a Taverna training day
Matching strategies
• Matching input workflow with 5 others1 2
3 4
5
?
Human on-line matching strategies!
• Traits
• Scores of attraction
• Yes or no
Matching strategy: traits
Men want.. Women want..
Short term relationship
Long term relationship
Slim Tall
Students, artists, musicians, veterinarians
Lawyers, financial execs, firemen
Blonde Hair or shaved
Medium income High income
From an analysis of 30 000
profiles
Matching strategy: scoring
Confidencelevel
Score
Percentile
www.AmIHotOrNot.com
Matching strategy: yes or no
Traits
• Predicted trait Biological subtask
Biological supertask
Shared inputs + outputs
Same service type
Shared service compositions
Shared path between intermediary input and output
Traits and score
• Predicted trait
• Score of similarity, usefulness and confidence
E.g. [1 Identical –
9 Not similar]
Biological subtask
Biological supertask
Shared inputs + outputs
Same service type
Shared service compositions
Shared path between intermediary input and output
The gold standard
?• The collection of
workflow similarity assessments
• Predictive traits, possibly interacting
1 + 5
Traits/score
2. How do people match up workflows?
• Difficulty of task
– Biological relationship very difficult for 6 out of 9
– Shape similarity difficult for 4 out of 13
– Medium confidence
• Consistency
– Inter participant disagreement on how to order biological similarity and shape similarity [Spearman rank order test]
• Predictive traits
– No one trait dominant between and within participants [Levene homogeneity of variance test]
Can we do better?
• Simpler tasks and workflows
• Taverna experienced users
• Workflow documentation and annotation
• Other factors in use, e.g. size difference
– Fix allowed factors
– Adopt black box approach: yes/no matching
Automated discovery technique• Unattributed graph matcher implementation by
Messmer and Bunke
– Sub-isomorphism detection; exponential time complexity
– DAGs and optimization for repository of graphs
• Workflows parsed as graphs
– Workflow input, workflow output andintermediate services as nodes
– Data links as edges
•probeSetid
AffyMapper_seq databaseid
Blastx
Results_Blastx
• Ranking based on
– shared nodes
– difference in size between input graph and repository graphs
Automated discovery technique
3. Can we replicate the behaviour with tools? Kind of..
Average similarity assessments across participants
?+
1 + 66
Traits/score
Current work
?+
1
2
3
...
1
2
3
12 + 21
Yes/no
Text clustering
OWL workflow ontology
Precision / recall
Graph matching
Take home • Scientists compose Web services for real – and
share their results
• Workflow discovery is a real problem, which subsumes service discovery
• A range of matching strategies and techniques apply
• Evaluation is a challenge - gold standards hard to build
• Come and play at myExperiment.org
• References at www.cs.man.ac.uk/~goderisa