Invited talk @ DCC09 workshop
-
Upload
paolo-missier -
Category
Technology
-
view
674 -
download
1
description
Transcript of Invited talk @ DCC09 workshop
![Page 1: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/1.jpg)
IDCC’09, London - P.Missier
Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK
with additional material by Sean Bechhofer and Matthew Gamble, e-Labs design group, University of Manchester
1
Scientific Workflow Management System
Researchobjects,myExperiment,andOpenProvenanceforcollabora;veE‐science
REPRISEworkshop‐IDCC’09
JanusProvenance
![Page 2: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/2.jpg)
IDCC’09, London - P.Missier
Momentum on sharing and collaboration
2http://www.nature.com/news/specials/datasharing/index.html
Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)
Special issue of Nature on Data Sharing (Sept. 2009)
![Page 3: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/3.jpg)
IDCC’09, London - P.Missier
Momentum on sharing and collaboration
• timeliness requires rapid sharing• repurposing• the Human Genome project use case
2http://www.nature.com/news/specials/datasharing/index.html
Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)
Special issue of Nature on Data Sharing (Sept. 2009)
![Page 4: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/4.jpg)
IDCC’09, London - P.Missier
Momentum on sharing and collaboration
• timeliness requires rapid sharing• repurposing• the Human Genome project use case
2http://www.nature.com/news/specials/datasharing/index.html
Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–169 (2009)
Special issue of Nature on Data Sharing (Sept. 2009)
• Ongoing debate in several communities– Clinical trials [1]– Earth Sciences -- ESIP - data preservation / stewardship, 2009– Long established in some communities - Atmospheric sciences,
1998 [2]• Science Commons recommendations for Open Science
– Open Science recommendations from Science Commons (July 2008) [link]
![Page 5: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/5.jpg)
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
![Page 6: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/6.jpg)
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
?
![Page 7: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/7.jpg)
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
?
![Page 8: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/8.jpg)
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
ResearchObject
Packaging
?
![Page 9: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/9.jpg)
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
ResearchObject
Packaging
?
![Page 10: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/10.jpg)
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
ResearchObject
Packaging
browse query
unbundle reuse
?
![Page 11: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/11.jpg)
Reference scenario
3
workflowexecution
workflow+
input datasetspecification
outcome(data)
outcome(provenance)
Data-mediatedimplicit
collaboration
ResearchObject
Packaging
browse query
unbundle reuse
?
![Page 12: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/12.jpg)
IDCC’09, London - P.Missier
Collaboration through data
1.Packaging:– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:– data format standardization efforts – metadata representation
• process provenance–workflow provenance
3.Container:– a repository for Research Objects 4
What is needed for B to make sense of A’s data?
![Page 13: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/13.jpg)
IDCC’09, London - P.Missier
Collaboration through data
1.Packaging:– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:– data format standardization efforts – metadata representation
• process provenance–workflow provenance
3.Container:– a repository for Research Objects 4
What is needed for B to make sense of A’s data?
①
![Page 14: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/14.jpg)
IDCC’09, London - P.Missier
Collaboration through data
1.Packaging:– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:– data format standardization efforts – metadata representation
• process provenance–workflow provenance
3.Container:– a repository for Research Objects 4
What is needed for B to make sense of A’s data?
①
②
![Page 15: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/15.jpg)
IDCC’09, London - P.Missier
Collaboration through data
1.Packaging:– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:– data format standardization efforts – metadata representation
• process provenance–workflow provenance
3.Container:– a repository for Research Objects 4
What is needed for B to make sense of A’s data?
①
③
②
![Page 16: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/16.jpg)
Common pathways
QTLPaul’sPackPaul’sResearchObject
①
![Page 17: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/17.jpg)
Results
Logs
Results
Paper
Slides
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
①
![Page 18: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/18.jpg)
Results
Logs
Results
Paper
Slides
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
①
![Page 19: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/19.jpg)
Results
Logs
Results
Paper
Slides
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
①
![Page 20: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/20.jpg)
Results
Logs
Results
Paper
SlidesFeeds intoproduces
Included in
produces
Published in
produces
Included in
Included in Included in
Published in
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
①
![Page 21: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/21.jpg)
Results
Logs
Results
Paper
SlidesFeeds intoproduces
Included in
produces
Published in
produces
Included in
Included in Included in
Published in
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
Aggregation
①
![Page 22: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/22.jpg)
Results
Logs
Results
Metadata
Paper
SlidesFeeds intoproduces
Included in
produces
Published in
produces
Included in
Included in Included in
Published in
Workflow 16
Workflow 13
Common pathways
QTLPaul’sPackPaul’sResearchObject
Representation
Domain Relations
Aggregation
①
![Page 23: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/23.jpg)
ORE: representing generic aggregations
6
Resource Map(descriptor)
Data structure
http://www.openarchives.org/ore/1.0/primer.html section 4
A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations: Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for Information Science and Technology (JASIST), to appear, 2009.
![Page 24: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/24.jpg)
②
![Page 25: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/25.jpg)
Content: Workflow provenance
8
A detailed trace of workflow execution- tasks performed, data transformations
- inputs used, outputs produced
③
![Page 26: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/26.jpg)
Content: Workflow provenance
8
A detailed trace of workflow execution- tasks performed, data transformations
- inputs used, outputs produced
③
![Page 27: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/27.jpg)
Content: Workflow provenance
8
lister
gene_id
output
pathway_genes
get pathwaysby genes1
merge pathways
concat gene pathway ids
A detailed trace of workflow execution- tasks performed, data transformations
- inputs used, outputs produced
③
![Page 28: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/28.jpg)
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design
IDCC’09, London - P.Missier
Why provenance matters, if done right
The W3C Incubator on Provenance has been collecting numerous use cases:http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#
![Page 29: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/29.jpg)
IDCC’09, London - P.Missier
What users expect to learn
• Causal relations:- which pathways come from which genes?- which processes contributed to producing an
image?- which process(es) caused data to be incorrect?- which data caused a process to fail?
• Process and data analytics:– analyze variations in output vs an input
parameter sweep (multiple process runs)– how often has my favourite service been
executed? on what inputs?– who produced this data?– how often does this pathway turn up when the
input genes range over a certain set S?
10
lister
gene_id
output
pathway_genes
get pathwaysby genes1
merge pathways
concat gene pathway ids
![Page 30: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/30.jpg)
IDCC’09, London - P.Missier
Open Provenance Model• graph of causal dependencies involving data and processors• not necessarily generated by a workflow!• v1.0.1 currently open for comments
11
A PwasGeneratedBy (R)
AP used (R)
A1
P3
A2
A3
A4
wgb(R1)
wgb(R2)
used(R3)
used(R4)
P1wgb(R5)
P2wgb(R6)
to enable provenance metadata exchange
Goal:
standardize causal dependencies
![Page 31: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/31.jpg)
IDCC’09, London - P.Missier
The 3rd provenance challenge
• Chosen workflow from the Pan-STARRS project– Panoramic Survey Telescope & Rapid Response Syste
• http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge
• Goal: – demonstrate “provenance interoperability” at query level
12
![Page 32: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/32.jpg)
The 3rd provenance challenge workflow
13
read input file
load database
verify
![Page 33: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/33.jpg)
The 3rd provenance challenge workflow
13
read input file
load database
verify
![Page 34: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/34.jpg)
Team A
OPM and query-interoperability
14
encode W as WA
run WA
Q(prov(WA))exportprov(WA)
prov(WA) execute query Q
OPM(prov(WA))
![Page 35: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/35.jpg)
Team A
OPM and query-interoperability
14
encode W as WA
run WA
Q(prov(WA))exportprov(WA)
prov(WA) execute query Q
OPM(prov(WA))
Team B
import execute query Q
PWA = import(OPM(prov(WA)))
Q(PWA)
![Page 36: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/36.jpg)
Team A
OPM and query-interoperability
14
encode W as WA
run WA
Q(prov(WA))exportprov(WA)
prov(WA) execute query Q
OPM(prov(WA))
?Team B
import execute query Q
PWA = import(OPM(prov(WA)))
Q(PWA)
![Page 37: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/37.jpg)
OPM in Taverna
15
skippable
![Page 38: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/38.jpg)
OPM in Taverna
15
skippable
![Page 39: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/39.jpg)
OPM in Taverna
15
➡ the answer to any TP query can be viewed as an OPM graph
➡ encoded as RDF/XML (using the Tupelo provenance API)
skippable
![Page 40: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/40.jpg)
Additional requirements
16
![Page 41: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/41.jpg)
Additional requirements
16
• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary
naming conventions– Linked Data in OPM?
![Page 42: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/42.jpg)
Additional requirements
16
• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary
naming conventions– Linked Data in OPM?
• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes
![Page 43: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/43.jpg)
Additional requirements
16
• Artifact values require uniform common identifier scheme– each group used artifacts to refer to its own data results– but those results were expressed using proprietary
naming conventions– Linked Data in OPM?
• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes
• OPM graphs can grow very large– reduce size by exporting only query results
• Taverna approach– multiple levels of abstraction
• through OPM accounts (“points of view”)
![Page 44: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/44.jpg)
Query results as OPM graphs
17
encode W as WA
run WA
Q(prov(WA))exportprov(WA)
prov(WA)execute query Q
OPM(prov(WA))
![Page 45: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/45.jpg)
Query results as OPM graphs
17
encode W as WA
run WA
Q(prov(WA))exportprov(WA)
prov(WA)execute query Q
OPM(prov(WA))
![Page 46: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/46.jpg)
Query results as OPM graphs
17
encode W as WA
run WA
Q(prov(WA))
prov(WA)execute query Q
OPM(prov(WA)) exportQ(prov(WA))
![Page 47: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/47.jpg)
Query results as OPM graphs
17
encode W as WA
run WA
Q(prov(WA))
prov(WA)execute query Q
exportQ(prov(WA))
OPM(Q(prov(WA)))
![Page 48: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/48.jpg)
Query results as OPM graphs
17
encode W as WA
run WA
Q(prov(WA))
prov(WA)execute query Q
exportQ(prov(WA))
OPM(Q(prov(WA)))
- Approach implemented in Taverna 2.1
- Internal provenance DB with ad hoc query language
- To be released soon
![Page 49: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/49.jpg)
Full-fledged data-mediated collaborations
18
exp. A
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
![Page 50: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/50.jpg)
Full-fledged data-mediated collaborations
18
exp. A
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
![Page 51: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/51.jpg)
Full-fledged data-mediated collaborations
18
result A → input B
exp. A
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
![Page 52: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/52.jpg)
Full-fledged data-mediated collaborations
18
result A → input B
exp. A
exp. B
resultdatasets
A
ResearchObject
Aresult
provenanceA
workflow A +input A
resultdatasets
B
ResearchObject
Bresult
provenanceB
workflow B+input B
![Page 53: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/53.jpg)
Full-fledged data-mediated collaborations
18
resultdatasets
B
ResearchObjectA+B
resultprovenance
A + B
workflow B +inputB
resultdatasets
A
workflow A +input A
result A → input B
![Page 54: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/54.jpg)
Full-fledged data-mediated collaborations
18
resultdatasets
B
ResearchObjectA+B
resultprovenance
A + B
workflow B +inputB
resultdatasets
A
workflow A +input A
result A → input B
Provenance composition accounts for implicit
collaboration
![Page 55: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/55.jpg)
Full-fledged data-mediated collaborations
18
resultdatasets
B
ResearchObjectA+B
resultprovenance
A + B
workflow B +inputB
resultdatasets
A
workflow A +input A
result A → input B
Provenance composition accounts for implicit
collaboration
Aligned with focus of upcoming Provenance Challenge 4:“connect my provenance to yours" into a whole OPM provenance graph.
![Page 56: Invited talk @ DCC09 workshop](https://reader033.fdocuments.us/reader033/viewer/2022060108/555065b3b4c905ae3f8b55fc/html5/thumbnails/56.jpg)
Contacts
19
The myGrid Consortium (Manchester, Southampton)
JanusProvenance
http://www.myexperiment.org
http://mygrid.org.uk