Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom...
-
Upload
jailyn-solloway -
Category
Documents
-
view
216 -
download
0
Transcript of Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom...
Provenance for GeneralizedProvenance for GeneralizedMap and Reduce WorkflowsMap and Reduce Workflows
Robert Ikeda, Hyunjung Park, Jennifer WidomStanford University
Pei ZhangPei ZhangYue LuYue Lu
Robert Ikeda
Provenance
Where data came from
How it was derived, manipulated, combined, processed, …
How it has evolved over time
Uses: Explanation Debugging and verification Recomputation
2
Robert Ikeda
The Panda Environment
Data-oriented workflows Graph of processing nodes Data sets on edges Statically-defined; batch execution; acyclic
3
In
I1… O
Robert Ikeda
Provenance
Backward tracing Find the input subsets that contributed to a
given output element
Forward tracing Determine which output elements were derived
from a particular input element
4
TwitterPosts
TwitterPosts
MovieSentimen
ts
MovieSentimen
ts
Robert Ikeda
Provenance
Basic idea Capture provenance one node at a time
(lazy or eager) Use it for backward and forward tracing Handle processing nodes of all types
5
Robert Ikeda
Generalized Map and Reduce Workflows
What if every nodewere a Map or Reduce function?
Provenance easier to define, capture, and exploit than in the general case
Transparent provenance capture in Hadoop Doesn’t interfere with parallelism or fault-tolerance
6
MM
R
MR
Robert Ikeda
Remainder of Talk
Defining Map and Reduce provenance
Recursive workflow provenance
Capturing and tracing provenance
System description and performance
Future work
7
Robert Ikeda
Remainder of Talk
Defining Map and Reduce provenance
Recursive workflow provenance
Capturing and tracing provenance
System description and performance
Future work
8
Surprising theoretical result Surprising theoretical result
Implementation details Implementation details
Robert Ikeda
Example
9
TweetsTweets
Diggs Diggs
TweetScan
DiggScan
Aggregate Filter
GoodMoviesGood
Movies
BadMovies
BadMovies
TM
DM
AM
Robert Ikeda
Transformation Properties
Deterministic Functions.
Multiplicity for Map Functions
Multiplicity for Reduce Functions
Monotonicity
10
Robert Ikeda
Map and Reduce Provenance
Map functions M(I) = UiI (M({i})) Provenance of oO is iI such that oM({i})
Reduce functions R(I) = U1≤ k ≤ n(R(Ik)) I1,…,In partition I on reduce-
key Provenance of oO is Ik I such that oR(Ik)
11
Robert Ikeda
Workflow Provenance
Intuitive recursive definition
Desirable “replay” property
o W(I*1,…, I*
n)
12
MM
R
MR
Usually holds, but not always Usually holds, but not always
o OI*
1
I*n In
I1E1
E2
…… O
Robert Ikeda
Replay Property Example
13
TweetScan
Summarize
CountTwitterPosts
TwitterPosts Inferred
Movie RatingsRating
Medians
#MoviesPer
Rating
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight
0
Twilight
2
Avatar 7
Twilight
7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar”
“I loved Twilight”
“Avatar was okay”
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar”
“I loved Twilight”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
Robert Ikeda
Replay Property Example
14
TweetScan
Summarize
CountTwitterPosts
TwitterPosts Inferred
Movie RatingsRating
Medians
#MoviesPer
Rating
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight
0
Twilight
2
Avatar 7
Twilight
7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar”
“I loved Twilight”
“Avatar was okay”
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar”
“I loved Twilight”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
Robert Ikeda
Replay Property Example
15
TweetScan
Summarize
CountTwitterPosts
TwitterPosts Inferred
Movie RatingsRating
Medians
#MoviesPer
Rating
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight
0
Twilight
2
Avatar 7
Twilight
7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
Robert Ikeda
Replay Property Example
16
TweetScan
Summarize
CountTwitterPosts
TwitterPosts Inferred
Movie RatingsRating
Medians
#MoviesPer
Rating
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight
0
Twilight
2
Avatar 7
Twilight
7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
Robert Ikeda
Replay Property Example
17
TweetScan
Summarize
CountTwitterPosts
TwitterPosts Inferred
Movie RatingsRating
Medians
#MoviesPer
Rating
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight
0
Twilight
2
Avatar 7
Twilight
7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 17 2
One-ManyFunction
NonmonotonicReduce
NonmonotonicReduce
Robert Ikeda
Capturing and Tracing Provenance
Map functions Add the input ID to each of the output elements
Reduce functions Add the input reduce-key to each of the output
elements
Tracing Straightforward recursive algorithms
18
Robert Ikeda
RAMP System
Built as an extension to Hadoop
Supports MapReduce Workflows Each node is a MapReduce job
Provenance capture is transparent Retaining Hadoop’s parallel execution and fault
tolerance
Users need not be aware of provenance capture Wrapping is automatic RAMP stores provenance separately from the
input and output data
19
Robert Ikeda
RAMP System: Provenance Capture
Hadoop components Record-reader Mapper Combiner (optional) Reducer Record-writer
20
Robert Ikeda
RAMP System: Provenance Capture
21
RecordReaderRecordReader
MapperMapper
(ki, vi)
(km, vm)
Input
Map Output
Wrapper Wrapper
Wrapper Wrapper
RecordReaderRecordReader(ki, vi)
MapperMapper
(ki, 〈 vi, p 〉 )(ki, vi)(km, vm)(km, 〈 vm, p 〉 )
Input
Map Output
p
p
Robert Ikeda
RAMP System: Provenance Capture
22
ReducerReducer
RecordWriterRecordWriter
(ko, vo)
Map Output
Output
(km, [vm1,…,vmn])
Wrapper Wrapper
Wrapper Wrapper
ReducerReducer(ko, vo)
RecordWriterRecordWriter
(ko, 〈 vo, kmID 〉 )(ko, vo)
(km, [vm1,…,vmn])(km, [ 〈 vm1, p1 〉 ,…, 〈 vmn, pn 〉 ])
Map Output
Output
(kmID, pj)(q, kmID)Provenance
q
Robert Ikeda
Experiments
51 large EC2 instances (Thank you, Amazon!)
Two MapReduce “workflows” Wordcount• Many-one with large fan-in• Input sizes: 100, 300, 500 GB
Terasort• One-one• Input sizes: 93, 279, 466 GB
23
Robert Ikeda
Results: Wordcount
24
Robert Ikeda
Results: Terasort
25
Robert Ikeda
Summary of Results
Overhead of provenance capture Terasort• 20% time overhead, 21% space overhead
Wordcount• 76% time overhead, space overhead depends
directly on fan-in Backward-tracing
Terasort• 1.5 seconds for one element
Wordcount• Time directly dependent on fan-in
26
Robert Ikeda
Future Work
RAMP Selective provenance capture More efficient backward and forward tracing Indexing
General Incorporating SQL processing nodes
27
PANDAPANDAA System for Provenance and A System for Provenance and
DataData
“stanford panda”