Report on Provenance Challenge 3, at the PC3 meeting, 2009
-
Upload
paolo-missier -
Category
Technology
-
view
267 -
download
2
description
Transcript of Report on Provenance Challenge 3, at the PC3 meeting, 2009
Paolo MissierInformation Management Group
School of Computer Science, University of Manchester, UK
Provenance Challenge 3 meetingAmsterdam, June 10-11, 2009
1
Provenance challenge 3and Taverna
1. Implement the challenge workflow
2
Interpreting the challenge
2. Answer the core (+optional) challenge queries
3. Produce and export OPM
4. Import and consume OPM
1. Implement the challenge workflow
2
Interpreting the challenge
➡ Expose the same amount of data that is visible in the Trident version of the workflow
2. Answer the core (+optional) challenge queries
3. Produce and export OPM
4. Import and consume OPM
1. Implement the challenge workflow
2
Interpreting the challenge
➡ Expose the same amount of data that is visible in the Trident version of the workflow
➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?
2. Answer the core (+optional) challenge queries
3. Produce and export OPM
4. Import and consume OPM
1. Implement the challenge workflow
2
Interpreting the challenge
➡ Expose the same amount of data that is visible in the Trident version of the workflow
➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?
➡ Export the smallest OPM graph that contains the query answer➡(graph for the entire run is a special “degenerate” case)
2. Answer the core (+optional) challenge queries
3. Produce and export OPM
4. Import and consume OPM
1. Implement the challenge workflow
2
Interpreting the challenge
➡ Expose the same amount of data that is visible in the Trident version of the workflow
➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?
➡ Export the smallest OPM graph that contains the query answer➡(graph for the entire run is a special “degenerate” case)
➡Map an OPM graph to an instance of a TP causal graph➡Use TP queries to answer the same challenge queries as
above
2. Answer the core (+optional) challenge queries
3. Produce and export OPM
4. Import and consume OPM
Our challenges
3
Our challenges
3
➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model
Our challenges
3
➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model
➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries?➡ challenge queries gave us good new requirements
Our challenges
3
➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model
➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries?➡ challenge queries gave us good new requirements
➡export the smallest OPM graph that contains the query answer➡ OPM generation is now fully integrated into TP query
answering
Our challenges
3
➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model
➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries?➡ challenge queries gave us good new requirements
➡export the smallest OPM graph that contains the query answer➡ OPM generation is now fully integrated into TP query
answering
➡Map an OPM graph to an instance of a TP causal graph➡Use TP queries to answer the same challenge queries as
above➡ TP query processing requires a representation of the
originating workflow➡ This required generating the “minimal plausible
originating dataflow” (MPOD) for the OPM graph
The challenge workflow as a Taverna dataflow
4
control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”
data links
The challenge workflow as a Taverna dataflow
4
control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”
data links
this produces a list...
The challenge workflow as a Taverna dataflow
4
control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”
data links
this produces a list...
...these consume one item of the list at a time....
The challenge workflow as a Taverna dataflow
4
control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”
data links
this produces a list...
...these consume one item of the list at a time....
the resulting iteration over the lists occurs automatically
TP query model and new requirements
5
- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow
TP query model and new requirements
5
trace the lineage of values observed here...
- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow
TP query model and new requirements
5
trace the lineage of values observed here...
...at these points in the workflow
- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow
TP query model and new requirements
5
trace the lineage of values observed here...
...at these points in the workflow
- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow
- challenge query 1 “detailed version” seems to require more knowledge on data dependencies than those obtained from this structural model
- level of detection ID not available unless query black box in LoadCSVFileIntoTable is opened up- i.e.,CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (?,?,?,?,?,?,?)
Example: query 3
6
Example: query 3query.vars= LoadCSVFileIntoTable / LoadCSVFileIntoTableOutput / 1
query.processors=ALL
Example: query 3
6
Example: query 3query.vars= LoadCSVFileIntoTable / LoadCSVFileIntoTableOutput / 1
query.processors=ALL
Integrated OPM generation as part of TP query
7
Integrated OPM generation as part of TP query
7
Integrated OPM generation as part of TP query
7
➡ the answer to any TP query can be viewed as an OPM graph
➡ encoded as RDF/XML using the Tupelo provenance API.
MPOD generation rules - I
8
MPOD = Minimal Plausible Originating Dataflow- induced from an OPM graph- The TP query models requires a workflow structure- this is a first approximation...subject to refinement
A PwasGeneratedBy (R)
AP used (R)
RP
R P
output port R
input port R
MPOD generation rules - I
8
MPOD = Minimal Plausible Originating Dataflow- induced from an OPM graph- The TP query models requires a workflow structure- this is a first approximation...subject to refinement
A PwasGeneratedBy (R)
AP used (R)
RP
R P
A1
P3
A2
A3
A4
wgb(R1)
wgb(R2)
used(R3)
used(R4)
P1wgb(R5)
P2wgb(R6)
R5
P1
R6
P2
R1
P3
R3 R4
R2
output port R
input port R
MPOD generation rules - II
9
A2wasDeterminedFrom A1
A2 PwasGeneratedBy (R2)
This is usually inferred, i.e. there exist P, R1, R2 such that:
A1P used (R1) R2
PR1
Note 1: if the corresponding “wgby” and “used” edges are not found, then new P, R1, R2 are created and added to the graph ‣ however, in all cases encountered so far, wasDeterminedFrom was inferred: P, R1, R2 appear in existing wgb(R) and used(R) edges
Note 2: wasControlledBy and wasTriggeredBy ignored for now
Derived property:
Note 3: a separate MPOD is created for each account in the OPM graph
10
• OPM contributions successfully imported so far:– UC Davis– NCSA– SOTON
• Example (UC Davis)
Importing OPM for the challenge
Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?
query.variables: LoadCSVFileIntoTable:2 / out query.processors=ALL
(links to PC3 / UoM wiki page)
MPODexample!
Example -- query 3 result• Ideally, the imported graph + MPOD allow provenance queries
to be submitted to the imported graph just as if it were a native TP graph
• The answer is viewed as a new OPM graph itself
11
Lossless mappings using OPM and MPOD
12
f exec → trace(f)
〈f, trace(f)〉TP
TP = Taverna Provenance model
Q OPMQ(trace(f))
subgraph with query answer only
Provenance Query:
Export to OPM:export is just a query exp(trace(f)) that returns the entire trace
Import from OPM:
OPMimport
〈f, trace(f)〉TP
export OPMexp(trace(f))
MPOD
Lossless mappings using OPM and MPOD
12
f exec → trace(f)
〈f, trace(f)〉TP
TP = Taverna Provenance model
Q OPMQ(trace(f))
subgraph with query answer only
Provenance Query:
Export to OPM:export is just a query exp(trace(f)) that returns the entire trace
Import from OPM:
OPMimport
〈f, trace(f)〉TP
export OPMexp(trace(f))
MPOD
when is this transformation loss-less?
More on lossless-ness and OPM
13
f exec → trace(f)
〈f, trace(f)〉TP
export OPMexp(trace(f))
〈f’, trace(f’)〉TP
import
Tavernadataflow
More on lossless-ness and OPM
13
f exec → trace(f)
〈f, trace(f)〉TP
export OPMexp(trace(f))
〈f’, trace(f’)〉TP
import export
Tavernadataflow
More on lossless-ness and OPM
13
f exec → trace(f)
〈f, trace(f)〉TP
export OPMexp(trace(f))
〈f’, trace(f’)〉TP
import export
Tavernadataflow
=?=
More on lossless-ness and OPM
13
f exec → trace(f)
〈f, trace(f)〉TP
export OPMexp(trace(f))
〈f’, trace(f’)〉TP
import export
This is indeed lossless when f is itself a Taverna dataflow:
export ( import (export (trace(f)))) =?= export (trace(f))
(requires proof)
Tavernadataflow
=?=