CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11....

18
CS 744: Big Data Systems Shivaram Venkataraman Fall 2018

Transcript of CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11....

Page 1: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

CS 744: Big Data Systems

Shivaram Venkataraman Fall 2018

Page 2: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

ADMINISTRIVIA

-  Assignment 2 grades up, Midterm grades this week -  Course Projects: round 2 meetings (Sign up!) -  Next Tuesday: Guest speaker for first part

Page 3: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

DATAFLOW MODEL (?)

Page 4: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

MOTIVATION

Streaming Video Provider - How much to bill each advertiser ? - Need per-user, per-video viewing sessions - Handle out of order data

Goals

- Easy to program - Balance correctness, latency and cost

Page 5: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

APPROACH

API Design - Separate user-facing model from execution - Decompose queries into - What is being computed - Where in time is it computed - When is it materialized - How does it relate to earlier results

Page 6: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

TERMINOLOGY

Unbounded/bounded data Streaming/Batch execution e.g., Flink vs. Spark Streaming

Timestamps

Event time: Processing time:

Window types

Page 7: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

WATERMARK or SKEW

System has processed all events up to 12:02:30

Page 8: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

API

ParDo: Parallel Do (very similar to map or flatMap) GroupByKey: Group values for a key Windowing

AssignWindow – Bucket tuple as it arrives

MergeWindow – Merge buckets based on grouping strategy

Page 9: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

EXAMPLE

GroupByKey

Page 10: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

TRIGGERS AND INCREMENTAL PROCESSING

Windowing: where in event time data are grouped Triggering: when in processing time groups are emitted Strategies

Discarding Accumulating Accumulating & Retracting

Details with running example

Page 11: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

RUNNING EXAMPLE PCollection<KV<String,Integer>>input=IO.read(...);PCollection<KV<String,Integer>>output=

input.apply(Sum.integersPerKey());

Page 12: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

GLOBAL WINDOWS, ACCUMULATE PCollection<KV<String,Integer>>output=input

.apply(Window.trigger(Repeat(AtPeriod(1,MINUTE))) .accumulating()).apply(Sum.integersPerKey());

Page 13: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

GLOBAL WINDOWS, COUNT, DISCARDING PCollection<KV<String,Integer>>output=input

.apply(Window.trigger(Repeat(AtCount(2))) .discarding()).apply(Sum.integersPerKey());

Page 14: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

FiXED WINDOWS, MICRO BATCH PCollection<KV<String,Integer>>output=input

.apply(Window.into(FixedWindows.of(2,MINUTES)) .trigger(Repeat(AtWatermark()))) .accumulating())

Page 15: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

FIXED WINDOWS, STREAMING PCollection<KV<String,Integer>>output=input

.apply(Window.into(FixedWindows.of(2,MINUTES)) .trigger(Repeat(AtWatermark()))) .accumulating())

Option to repeat at processing time intervals

Page 16: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

SESSIONS, RETRACTING

output=input.apply(Window.into(Sessions.gap(1,Min)).trigger(SequenceOf(RepeatUntil(AtPeriod(1,MINUTE),AtWatermark()),Repeat(AtWatermark()))).accumulatingAndRetracting()).apply(Sum.integersPerKey());

Page 17: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

LESSONS / EXPERIENCES

-  Don’t rely on completeness

-  Be flexible, diverse use cases -  Billing -  Recommendation -  Anomaly detection

-  Support analysis in context of events

Page 18: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-dataflow.pdf · 2018. 11. 8. · CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 . ADMINISTRIVIA ...

SUMMARY

- Model of streaming window triggers, processing -  Separate user-level view from implementation

-  Motivated by real-world use cases, Cloud Dataflow SDK