Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom...

28
Provenance for Generalized Provenance for Generalized Map and Reduce Workflows Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Pei Zhang Yue Lu Yue Lu

Transcript of Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom...

Page 1: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Provenance for GeneralizedProvenance for GeneralizedMap and Reduce WorkflowsMap and Reduce Workflows

Robert Ikeda, Hyunjung Park, Jennifer WidomStanford University

Pei ZhangPei ZhangYue LuYue Lu

Page 2: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Provenance

Where data came from

How it was derived, manipulated, combined, processed, …

How it has evolved over time

Uses: Explanation Debugging and verification Recomputation

2

Page 3: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

The Panda Environment

Data-oriented workflows Graph of processing nodes Data sets on edges Statically-defined; batch execution; acyclic

3

In

I1… O

Page 4: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Provenance

Backward tracing Find the input subsets that contributed to a

given output element

Forward tracing Determine which output elements were derived

from a particular input element

4

TwitterPosts

TwitterPosts

MovieSentimen

ts

MovieSentimen

ts

Page 5: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Provenance

Basic idea Capture provenance one node at a time

(lazy or eager) Use it for backward and forward tracing Handle processing nodes of all types

5

Page 6: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Generalized Map and Reduce Workflows

What if every nodewere a Map or Reduce function?

Provenance easier to define, capture, and exploit than in the general case

Transparent provenance capture in Hadoop Doesn’t interfere with parallelism or fault-tolerance

6

MM

R

MR

Page 7: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Remainder of Talk

Defining Map and Reduce provenance

Recursive workflow provenance

Capturing and tracing provenance

System description and performance

Future work

7

Page 8: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Remainder of Talk

Defining Map and Reduce provenance

Recursive workflow provenance

Capturing and tracing provenance

System description and performance

Future work

8

Surprising theoretical result Surprising theoretical result

Implementation details Implementation details

Page 9: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Example

9

TweetsTweets

Diggs Diggs

TweetScan

DiggScan

Aggregate Filter

GoodMoviesGood

Movies

BadMovies

BadMovies

TM

DM

AM

Page 10: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Transformation Properties

Deterministic Functions.

Multiplicity for Map Functions

Multiplicity for Reduce Functions

Monotonicity

10

Page 11: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Map and Reduce Provenance

Map functions M(I) = UiI (M({i})) Provenance of oO is iI such that oM({i})

Reduce functions R(I) = U1≤ k ≤ n(R(Ik)) I1,…,In partition I on reduce-

key Provenance of oO is Ik I such that oR(Ik)

11

Page 12: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Workflow Provenance

Intuitive recursive definition

Desirable “replay” property

o W(I*1,…, I*

n)

12

MM

R

MR

Usually holds, but not always Usually holds, but not always

o OI*

1

I*n In

I1E1

E2

…… O

Page 13: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Replay Property Example

13

TweetScan

Summarize

CountTwitterPosts

TwitterPosts Inferred

Movie RatingsRating

Medians

#MoviesPer

Rating

#MoviesPer

Rating

M R R

Movie Rating

Avatar 8

Twilight

0

Twilight

2

Avatar 7

Twilight

7

Avatar 4

“Avatar was great”

“I hated Twilight”

“Twilight was pretty bad”

“I enjoyed Avatar”

“I loved Twilight”

“Avatar was okay”

“Avatar was great”

“I hated Twilight”

“Twilight was pretty bad”

“I enjoyed Avatar”

“I loved Twilight”

“Avatar was okay”

Movie Median

Avatar 7

Twilight 2

Median #Movies

2 1

7 1

Page 14: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Replay Property Example

14

TweetScan

Summarize

CountTwitterPosts

TwitterPosts Inferred

Movie RatingsRating

Medians

#MoviesPer

Rating

#MoviesPer

Rating

M R R

Movie Rating

Avatar 8

Twilight

0

Twilight

2

Avatar 7

Twilight

7

Avatar 4

“Avatar was great”

“I hated Twilight”

“Twilight was pretty bad”

“I enjoyed Avatar”

“I loved Twilight”

“Avatar was okay”

“Avatar was great”

“I hated Twilight”

“Twilight was pretty bad”

“I enjoyed Avatar”

“I loved Twilight”

“Avatar was okay”

Movie Median

Avatar 7

Twilight 2

Median #Movies

2 1

7 1

Page 15: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Replay Property Example

15

TweetScan

Summarize

CountTwitterPosts

TwitterPosts Inferred

Movie RatingsRating

Medians

#MoviesPer

Rating

#MoviesPer

Rating

M R R

Movie Rating

Avatar 8

Twilight

0

Twilight

2

Avatar 7

Twilight

7

Avatar 4

“Avatar was great”

“I hated Twilight”

“Twilight was pretty bad”

“I enjoyed AvatarAnd Twilight too”

“Avatar was okay”

“Avatar was great”

“I hated Twilight”

“Twilight was pretty bad”

“I enjoyed AvatarAnd Twilight too”

“Avatar was okay”

Movie Median

Avatar 7

Twilight 2

Median #Movies

2 1

7 1

Page 16: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Replay Property Example

16

TweetScan

Summarize

CountTwitterPosts

TwitterPosts Inferred

Movie RatingsRating

Medians

#MoviesPer

Rating

#MoviesPer

Rating

M R R

Movie Rating

Avatar 8

Twilight

0

Twilight

2

Avatar 7

Twilight

7

Avatar 4

“Avatar was great”

“I hated Twilight”

“Twilight was pretty bad”

“I enjoyed AvatarAnd Twilight too”

“Avatar was okay”

“Avatar was great”

“I hated Twilight”

“Twilight was pretty bad”

“I enjoyed AvatarAnd Twilight too”

“Avatar was okay”

Movie Median

Avatar 7

Twilight 2

Median #Movies

2 1

7 1

Page 17: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Replay Property Example

17

TweetScan

Summarize

CountTwitterPosts

TwitterPosts Inferred

Movie RatingsRating

Medians

#MoviesPer

Rating

#MoviesPer

Rating

M R R

Movie Rating

Avatar 8

Twilight

0

Twilight

2

Avatar 7

Twilight

7

Avatar 4

“Avatar was great”

“I hated Twilight”

“Twilight was pretty bad”

“I enjoyed AvatarAnd Twilight too”

“Avatar was okay”

“Avatar was great”

“I hated Twilight”

“Twilight was pretty bad”

“I enjoyed AvatarAnd Twilight too”

“Avatar was okay”

Movie Median

Avatar 7

Twilight 2

Median #Movies

2 1

7 17 2

One-ManyFunction

NonmonotonicReduce

NonmonotonicReduce

Page 18: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Capturing and Tracing Provenance

Map functions Add the input ID to each of the output elements

Reduce functions Add the input reduce-key to each of the output

elements

Tracing Straightforward recursive algorithms

18

Page 19: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

RAMP System

Built as an extension to Hadoop

Supports MapReduce Workflows Each node is a MapReduce job

Provenance capture is transparent Retaining Hadoop’s parallel execution and fault

tolerance

Users need not be aware of provenance capture Wrapping is automatic RAMP stores provenance separately from the

input and output data

19

Page 20: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

RAMP System: Provenance Capture

Hadoop components Record-reader Mapper Combiner (optional) Reducer Record-writer

20

Page 21: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

RAMP System: Provenance Capture

21

RecordReaderRecordReader

MapperMapper

(ki, vi)

(km, vm)

Input

Map Output

Wrapper Wrapper

Wrapper Wrapper

RecordReaderRecordReader(ki, vi)

MapperMapper

(ki, 〈 vi, p 〉 )(ki, vi)(km, vm)(km, 〈 vm, p 〉 )

Input

Map Output

p

p

Page 22: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

RAMP System: Provenance Capture

22

ReducerReducer

RecordWriterRecordWriter

(ko, vo)

Map Output

Output

(km, [vm1,…,vmn])

Wrapper Wrapper

Wrapper Wrapper

ReducerReducer(ko, vo)

RecordWriterRecordWriter

(ko, 〈 vo, kmID 〉 )(ko, vo)

(km, [vm1,…,vmn])(km, [ 〈 vm1, p1 〉 ,…, 〈 vmn, pn 〉 ])

Map Output

Output

(kmID, pj)(q, kmID)Provenance

q

Page 23: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Experiments

51 large EC2 instances (Thank you, Amazon!)

Two MapReduce “workflows” Wordcount• Many-one with large fan-in• Input sizes: 100, 300, 500 GB

Terasort• One-one• Input sizes: 93, 279, 466 GB

23

Page 24: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Results: Wordcount

24

Page 25: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Results: Terasort

25

Page 26: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Summary of Results

Overhead of provenance capture Terasort• 20% time overhead, 21% space overhead

Wordcount• 76% time overhead, space overhead depends

directly on fan-in Backward-tracing

Terasort• 1.5 seconds for one element

Wordcount• Time directly dependent on fan-in

26

Page 27: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Robert Ikeda

Future Work

RAMP Selective provenance capture More efficient backward and forward tracing Indexing

General Incorporating SQL processing nodes

27

Page 28: Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

PANDAPANDAA System for Provenance and A System for Provenance and

DataData

“stanford panda”