Report on Provenance Challenge 3, at the PC3 meeting, 2009

35
Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK Provenance Challenge 3 meeting Amsterdam, June 10-11, 2009 1 Provenance challenge 3 and Taverna

description

Amsterdam, June 10-11, 2009

Transcript of Report on Provenance Challenge 3, at the PC3 meeting, 2009

Page 1: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Paolo MissierInformation Management Group

School of Computer Science, University of Manchester, UK

Provenance Challenge 3 meetingAmsterdam, June 10-11, 2009

1

Provenance challenge 3and Taverna

Page 2: Report on Provenance Challenge 3, at the PC3 meeting, 2009

1. Implement the challenge workflow

2

Interpreting the challenge

2. Answer the core (+optional) challenge queries

3. Produce and export OPM

4. Import and consume OPM

Page 3: Report on Provenance Challenge 3, at the PC3 meeting, 2009

1. Implement the challenge workflow

2

Interpreting the challenge

➡ Expose the same amount of data that is visible in the Trident version of the workflow

2. Answer the core (+optional) challenge queries

3. Produce and export OPM

4. Import and consume OPM

Page 4: Report on Provenance Challenge 3, at the PC3 meeting, 2009

1. Implement the challenge workflow

2

Interpreting the challenge

➡ Expose the same amount of data that is visible in the Trident version of the workflow

➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?

2. Answer the core (+optional) challenge queries

3. Produce and export OPM

4. Import and consume OPM

Page 5: Report on Provenance Challenge 3, at the PC3 meeting, 2009

1. Implement the challenge workflow

2

Interpreting the challenge

➡ Expose the same amount of data that is visible in the Trident version of the workflow

➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?

➡ Export the smallest OPM graph that contains the query answer➡(graph for the entire run is a special “degenerate” case)

2. Answer the core (+optional) challenge queries

3. Produce and export OPM

4. Import and consume OPM

Page 6: Report on Provenance Challenge 3, at the PC3 meeting, 2009

1. Implement the challenge workflow

2

Interpreting the challenge

➡ Expose the same amount of data that is visible in the Trident version of the workflow

➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?

➡ Export the smallest OPM graph that contains the query answer➡(graph for the entire run is a special “degenerate” case)

➡Map an OPM graph to an instance of a TP causal graph➡Use TP queries to answer the same challenge queries as

above

2. Answer the core (+optional) challenge queries

3. Produce and export OPM

4. Import and consume OPM

Page 7: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Our challenges

3

Page 8: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Our challenges

3

➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model

Page 9: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Our challenges

3

➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model

➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries?➡ challenge queries gave us good new requirements

Page 10: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Our challenges

3

➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model

➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries?➡ challenge queries gave us good new requirements

➡export the smallest OPM graph that contains the query answer➡ OPM generation is now fully integrated into TP query

answering

Page 11: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Our challenges

3

➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model

➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries?➡ challenge queries gave us good new requirements

➡export the smallest OPM graph that contains the query answer➡ OPM generation is now fully integrated into TP query

answering

➡Map an OPM graph to an instance of a TP causal graph➡Use TP queries to answer the same challenge queries as

above➡ TP query processing requires a representation of the

originating workflow➡ This required generating the “minimal plausible

originating dataflow” (MPOD) for the OPM graph

Page 12: Report on Provenance Challenge 3, at the PC3 meeting, 2009

The challenge workflow as a Taverna dataflow

4

control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”

data links

Page 13: Report on Provenance Challenge 3, at the PC3 meeting, 2009

The challenge workflow as a Taverna dataflow

4

control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”

data links

this produces a list...

Page 14: Report on Provenance Challenge 3, at the PC3 meeting, 2009

The challenge workflow as a Taverna dataflow

4

control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”

data links

this produces a list...

...these consume one item of the list at a time....

Page 15: Report on Provenance Challenge 3, at the PC3 meeting, 2009

The challenge workflow as a Taverna dataflow

4

control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”

data links

this produces a list...

...these consume one item of the list at a time....

the resulting iteration over the lists occurs automatically

Page 16: Report on Provenance Challenge 3, at the PC3 meeting, 2009

TP query model and new requirements

5

- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow

Page 17: Report on Provenance Challenge 3, at the PC3 meeting, 2009

TP query model and new requirements

5

trace the lineage of values observed here...

- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow

Page 18: Report on Provenance Challenge 3, at the PC3 meeting, 2009

TP query model and new requirements

5

trace the lineage of values observed here...

...at these points in the workflow

- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow

Page 19: Report on Provenance Challenge 3, at the PC3 meeting, 2009

TP query model and new requirements

5

trace the lineage of values observed here...

...at these points in the workflow

- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow

- challenge query 1 “detailed version” seems to require more knowledge on data dependencies than those obtained from this structural model

- level of detection ID not available unless query black box in LoadCSVFileIntoTable is opened up- i.e.,CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (?,?,?,?,?,?,?)

Page 20: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Example: query 3

6

Example: query 3query.vars= LoadCSVFileIntoTable / LoadCSVFileIntoTableOutput / 1

query.processors=ALL

Page 21: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Example: query 3

6

Example: query 3query.vars= LoadCSVFileIntoTable / LoadCSVFileIntoTableOutput / 1

query.processors=ALL

Page 22: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Integrated OPM generation as part of TP query

7

Page 24: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Integrated OPM generation as part of TP query

7

➡ the answer to any TP query can be viewed as an OPM graph

➡ encoded as RDF/XML using the Tupelo provenance API.

Page 25: Report on Provenance Challenge 3, at the PC3 meeting, 2009

MPOD generation rules - I

8

MPOD = Minimal Plausible Originating Dataflow- induced from an OPM graph- The TP query models requires a workflow structure- this is a first approximation...subject to refinement

A PwasGeneratedBy (R)

AP used (R)

RP

R P

output port R

input port R

Page 26: Report on Provenance Challenge 3, at the PC3 meeting, 2009

MPOD generation rules - I

8

MPOD = Minimal Plausible Originating Dataflow- induced from an OPM graph- The TP query models requires a workflow structure- this is a first approximation...subject to refinement

A PwasGeneratedBy (R)

AP used (R)

RP

R P

A1

P3

A2

A3

A4

wgb(R1)

wgb(R2)

used(R3)

used(R4)

P1wgb(R5)

P2wgb(R6)

R5

P1

R6

P2

R1

P3

R3 R4

R2

output port R

input port R

Page 27: Report on Provenance Challenge 3, at the PC3 meeting, 2009

MPOD generation rules - II

9

A2wasDeterminedFrom A1

A2 PwasGeneratedBy (R2)

This is usually inferred, i.e. there exist P, R1, R2 such that:

A1P used (R1) R2

PR1

Note 1: if the corresponding “wgby” and “used” edges are not found, then new P, R1, R2 are created and added to the graph ‣ however, in all cases encountered so far, wasDeterminedFrom was inferred: P, R1, R2 appear in existing wgb(R) and used(R) edges

Note 2: wasControlledBy and wasTriggeredBy ignored for now

Derived property:

Note 3: a separate MPOD is created for each account in the OPM graph

Page 28: Report on Provenance Challenge 3, at the PC3 meeting, 2009

10

• OPM contributions successfully imported so far:– UC Davis– NCSA– SOTON

• Example (UC Davis)

Importing OPM for the challenge

Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?

query.variables: LoadCSVFileIntoTable:2 / out query.processors=ALL

(links to PC3 / UoM wiki page)

MPODexample!

Page 29: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Example -- query 3 result• Ideally, the imported graph + MPOD allow provenance queries

to be submitted to the imported graph just as if it were a native TP graph

• The answer is viewed as a new OPM graph itself

11

Page 30: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Lossless mappings using OPM and MPOD

12

f exec → trace(f)

〈f, trace(f)〉TP

TP = Taverna Provenance model

Q OPMQ(trace(f))

subgraph with query answer only

Provenance Query:

Export to OPM:export is just a query exp(trace(f)) that returns the entire trace

Import from OPM:

OPMimport

〈f, trace(f)〉TP

export OPMexp(trace(f))

MPOD

Page 31: Report on Provenance Challenge 3, at the PC3 meeting, 2009

Lossless mappings using OPM and MPOD

12

f exec → trace(f)

〈f, trace(f)〉TP

TP = Taverna Provenance model

Q OPMQ(trace(f))

subgraph with query answer only

Provenance Query:

Export to OPM:export is just a query exp(trace(f)) that returns the entire trace

Import from OPM:

OPMimport

〈f, trace(f)〉TP

export OPMexp(trace(f))

MPOD

when is this transformation loss-less?

Page 32: Report on Provenance Challenge 3, at the PC3 meeting, 2009

More on lossless-ness and OPM

13

f exec → trace(f)

〈f, trace(f)〉TP

export OPMexp(trace(f))

〈f’, trace(f’)〉TP

import

Tavernadataflow

Page 33: Report on Provenance Challenge 3, at the PC3 meeting, 2009

More on lossless-ness and OPM

13

f exec → trace(f)

〈f, trace(f)〉TP

export OPMexp(trace(f))

〈f’, trace(f’)〉TP

import export

Tavernadataflow

Page 34: Report on Provenance Challenge 3, at the PC3 meeting, 2009

More on lossless-ness and OPM

13

f exec → trace(f)

〈f, trace(f)〉TP

export OPMexp(trace(f))

〈f’, trace(f’)〉TP

import export

Tavernadataflow

=?=

Page 35: Report on Provenance Challenge 3, at the PC3 meeting, 2009

More on lossless-ness and OPM

13

f exec → trace(f)

〈f, trace(f)〉TP

export OPMexp(trace(f))

〈f’, trace(f’)〉TP

import export

This is indeed lossless when f is itself a Taverna dataflow:

export ( import (export (trace(f)))) =?= export (trace(f))

(requires proof)

Tavernadataflow

=?=