Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental...
Transcript of Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental...
![Page 1: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/1.jpg)
Automatic vs Manual Provenance
Abstractions: Mind the Gap
Pınar Alper
TAPP
9 June 2016
Carole A. GOBLE
University of
Manchester
Khalid BELHAJJAME
Université Paris
Dauphine
Pınar ALPER
University of
Manchester
![Page 2: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/2.jpg)
Transparency brought by Provenance
can be a double edged sword
Too revealing
validation formData2formData1
wgbusd
formData3formData1wib
sanitisationformData
3
usdwgb
Secured Provenance
![Page 3: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/3.jpg)
Too complex
resultinput
wgbinput
adapt
2
Transparency brought by Provenancecan be a double edged sword
simulation
Simulation
process result
adapt
1adapt
3
usd wgb wgb wgb wgb
Simplified Provenance
usd usd usd
usd
![Page 4: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/4.jpg)
– Reports are experimental metadata on:
• Method
• Data
“Toward interoperable bioscience data.” Nature Genetics, 44(2):121–126, February 2012.
“Scientific Data”, Open-Access Journal. Nature Publishing Group, 2015, http://scientificdata.isa-explorer.org
“Best Practices for Workflow Design: How to Prevent Workflow Decay.” In Proceedings of SWAT4LS, November 2012.
Data
AnnotationsData
B
undle Workflow
&
PROV-O
Compare Manual and Semi-automated abstractions
that simplify workflows in the context of reporting
data-oriented experiments.
Our Goal & Context
![Page 5: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/5.jpg)
Workflow complexity necessitates abstraction
• Up to 50+ data
processing tasks.
• Majority (70 %)
dedicated to data
adaptation
• Leads to complex
provenance.
“Common Motifs in Scientific Workflows: An empirical analysis.” Future Generation Comp. Sys. 36: 338-351 (2014).
![Page 6: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/6.jpg)
• Manual, Design Abstractions observable in
existing workflows
• Embedded into design, static
Current Approach
![Page 7: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/7.jpg)
Bookmarked Intermediaries
Design Abstractions
![Page 8: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/8.jpg)
Alternative Approach
Semi-automated abstraction systems:
– dynamic
– Workflow Summaries & ZOOM UserViews
“Small is beautiful: Summarizing scientific workflows using semantic annotations.” BigData 2013, pages 318-325.
“Querying and Managing Provenance through User Views in Scientific Workflows.” ICDE 2008, pages 1072–1081.
• ProvAbs, IPAW
2014
• Propub, SSDBM
2011
• TACLP, FGCS 2015
• SecProv, WAIM 2008
• Provenance
Redaction, SACMAT
2011
• Surrogates, VLDB
2011
![Page 9: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/9.jpg)
Workflow Abstraction Primitives
Task grouping Task elimination
Task A
Task B
df
df
df
inA
outA
inB
outB
Task A & B
inA
outB
df
df
Task A
Task B
df
df
inA
outA
inB
outB
Task C
inC
outC
Task A
indirect
df
inA
outA
Task C
inC
outC
![Page 10: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/10.jpg)
Integrity Policy: Soundness
a
d1
b
d2d
3 d4
d5
d1d2
d4d5
a&b
In the context of reporting soundness can be compromised.
usd
wg
b
Sub-workflow based design abstractions do not necessarily
preserve soundness.
usd
wg
b
usd
wg
b
![Page 11: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/11.jpg)
Integrity Policy: Acyclicity
a
d1
c
d2
d4
d5
b
d3
d
1
d5
a & c
d2
d4
b
From a modelling perspective cycles are allowed in provenance.
usd
wg
b
usd
wg
b
![Page 12: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/12.jpg)
Cyclic dataflows unfold into Acyclic Lineage
a
b
a.1
b.1
a.2
b.1
df
df
df
df
d1
d2
d3
d4
d4
In the context of reporting (e.g.
data tables, design abstractions)
cycles are not observed.
Raw workflow provenance is
acyclic.
usd
wgb
usd
usd
usd
wgb
wgb
wgb
![Page 13: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/13.jpg)
Integrity Policy: Bipartiteness
d1
d2
a
Workflow provenance is bi-partite.
Design abstractions preserve bipartiteness.
In the context of reporting bi-partiteness is advised but not always
achieved.
wg
b
usd
![Page 14: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/14.jpg)
Integrity Policy: Validity
a
d1
b
d2 d3
d5
a
d1
d2
d5
Design abstractions preserve validity.
In the context of reporting validity is a necessary property.
usd
?
?
usd
wg
b
usd
wg
b
![Page 15: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/15.jpg)
Integrity Policy: Completeness
a
d1
b
d2 d3d4
d5
d1d2
d4d5
a&b
Determined by preservation of lineage relations
Design abstractions preserve completeness
In the context of reporting completeness is a necessary property.
usd
wg
b
usd
wg
b
usd
wg
b
![Page 16: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/16.jpg)
Comparison of Systems
Workflow Summaries ZOOM Design Abstractions
Abstraction Policy Annotation-Primitive pairs Important Task Ids --
Primitive Grouping Elimination Grouping Grouping
Integrity Policy
Validity ✓ ✓ ✓ ✓
Soundness ✓ ✓
Bipartiteness ✓ ✓ ✓
Completeness ✓ ✓ ✓ ✓
Acyclicity ✓ ✓ ✓
![Page 17: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/17.jpg)
Retrieve Data
• Design Abstractions as groundtruth– Ports on main data derivation path
– Sub-workflow tasks.
• Workflow Summaries
– Eliminate All Adapters
– Collapse All Adapters
– Collapse
• ZOOM
– Non-Adapter tasks designated as significant
Pack
parameters
Unpack results
Change format
Perform
Analysis
Comparison Against
Design Abstraction
![Page 18: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/18.jpg)
Task Elimination
total # of processes in the
abstracted account
A significant task and its report worthy output
are not necessarily co-located
Hopping over traces does not simplify the
account data –wise as as it does process-
wise
Process
Precision
Data
Precision
# of processes in the abstracted
account overlapping w user’s
abstraction
![Page 19: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/19.jpg)
Task Grouping
Process Data
ZOOM’s soundness policy
creates two extra groups
Where you put the boundary to groups
matters for data abstraction!
![Page 20: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/20.jpg)
Task Grouping
Process Data
Abstracting selectively (less aggressively) by taking
activity function (hence I/O characteristics) into
account.
![Page 21: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/21.jpg)
Conclusions
• Abstraction systems focus on the process and do not directly cater for data significance.
• Are these observations generalisable?
– Elimination unsuited for WFs with pack/unpack steps
– Sweeping style grouping isn’t helpful for data abstraction
– Selective grouping relies on domain-specific policies.
• End-use informs suitable integrity policies.
• Scientists are accountable for reports. They are likely to favor having final say on the abstraction.
• Rethink abstraction as a prehoc process supporting workflow design.
“TOWARDS HARNESSING COMPUTATIONAL WORKFLOW PROVENANCE FOR EXPERIMENT REPORTING”, PhD
Dissertation, University of Manchester E-Scholar Repository. https://www.escholar.manchester.ac.uk/uk-ac-man-scw:300560
![Page 22: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,](https://reader034.fdocuments.us/reader034/viewer/2022050218/5f6421f4f855f8520e32e2f2/html5/thumbnails/22.jpg)
Thank You!