Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for...

56
1 © DIMA 2018 Scaling Stream Processing Out and Up Tilmann Rabl www.dima.tu-berlin.de | bbdc.berlin | [email protected] International Workshop on Web Data Processing & Reasoning (WDPAR 2018)

Transcript of Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for...

Page 1: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

1 © DIMA 2018© 2013 Berlin Big Data Center • All Rights Reserved1 © DIMA 2018

Scaling Stream Processing Out and Up

Tilmann Rablwww.dima.tu-berlin.de | bbdc.berlin | [email protected]

International Workshop on Web Data Processing & Reasoning (WDPAR 2018)

Page 2: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

2 © DIMA 20182

2 © DIMA 2018

Big Fast Data• Data is growing and can be evaluated

– Tweets, social networks (statuses, check-ins, shared content), blogs, click streams, various logs, …

– Facebook: > 845M active users, > 8B messages/day

– Twitter: > 140M active users, > 340M tweets/day

• Everyone is interested!

Image: Michael Carey

Page 3: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

3 © DIMA 20183

3 © DIMA 2018

But there is so much more…• Autonomous Driving

– Requires rich navigation info– Rich data sensor readings– 1GB data per minute per car (all sensors)1

• Traffic Monitoring– High event rates: millions events / sec– High query rates: thousands queries / sec– Queries: filtering, notifications, analytical

• Pre-processing of sensor data– CERN experiments generate ~1PB of measurements per second.– Unfeasible to store or process directly, fast preprocessing is a must.

1Cobb: http://www.hybridcars.com/tech-experts-put-the-brakes-on-autonomous-cars/

Source: http://theroadtochangeindia.wordpress.com/2011/01/13/better-roads/

Page 4: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

4 © DIMA 20184

4 © DIMA 2018

Stream Processing

Interesting streams– Many different queries– Continuous results

Stream ProcessorData Stream Result Stream

Page 5: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

5 © DIMA 20185

5 © DIMA 2018

Why is this hard?

Tension between performance and algorithmic expressiveness

Image: Peter Pietzuch

Page 6: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

6 © DIMA 20186

6 © DIMA 2018

AgendaIntroduction to Streams• Stream processing 101• Efficient aggregation

Scale-Out Stream Processing Systems• Ingredients of a stream processing system• More details on Flink

Scale-Up Stream Processing• New hardware

With slides from Data Artisans, Volker Markl, and Sebastian Bress

Page 7: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

7 © DIMA 20187 © DIMA 2018

Stream Processing 101

Based on the Data Flow Model

Page 8: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

8 © DIMA 20188

8 © DIMA 2018

What is a Stream?• Unbounded data

– Conceptually infinite, ever growing set of data items / events– Practically continuous stream of data, which needs to be processed / analyzed

• Push model– Data production and procession is controlled by the source– Publish / subscribe model

• Concept of time– Often need to reason about when data is produced and when processed data should be

output– Time agnostic, processing time, ingestion time, event time

This part is largely based on Tyler Akidau‘s great blog on streaming - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

Page 9: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

9 © DIMA 20189

9 © DIMA 2018

Event Time

• Event time– Data item production time

• Ingestion time– System time when data item is received

• Processing time– System time when data item is processed

• Typically, these do not match!• In practice, streams are unordered!

Event Time

Proc

essin

g Ti

me

Ideal

RealSkew

Page 10: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

10 © DIMA 201810

10 © DIMA 2018

Windows• Fixed

– Also tumbling• Sliding

– Also hopping• Session

– Based on activity

• Triggered by – Event time, processing time, count, watermark

• Eviction policy– Window width / size

Page 11: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

11 © DIMA 201811

11 © DIMA 2018

Processing Time Windows

• System waits for x time units– System decides on stream partitioning– Simple, easy to implement– Ignores any time information in the stream -> any aggregation can be arbitrary

• Similar: Counting Windows

Image: Tyler Akidau

Page 12: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

12 © DIMA 201812

12 © DIMA 2018

Event Time Windows

• Windows based on the time information in stream– Adheres to stream semantic– Correct calculations – Buffering required, potentially unordered (more on this later)

Images: Tyler Akidau

Page 13: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

13 © DIMA 201813

13 © DIMA 2018

• Windowed Aggregation– E.g., average speed– Sum of URL accesses– Daily highscore

• Windowed Join– Correlated observations in timeframe– E.g., temperature in time

Aggregate

Basic Stream Operators

91210

Page 14: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

14 © DIMA 201814

14 © DIMA 2018

Efficient Window AggregationStream processing on overlapping windowsAggregate computation is redundant Partial aggregates can be shared Challenge: session windows, user defined windows, out of order tuples

7

Event Time

5 3 6 5 8 8 2 4 3 4 6 0Raw Stream

Windows

1215

Partial Aggregates

ValueSlices

1116

187

11 15

Aggregate Sharing

18 11 6+ + = 35Final Aggregation:

Page 15: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

15 © DIMA 201815

15 © DIMA 2018

Session Window Observations

Windows with different gaps share partial aggregatesSession windows can share aggregates with sliding and tumbling windowsSlice on session and gap is equivalent to session sliceSlicing depends on session window with smallest gap

Stream Slicing Example:Concurrent Session Windowswith gaps 3,5,6, and 7

Page 16: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

16 © DIMA 201816

16 © DIMA 2018

Generalized Stream Slicing*

Stream Slicer for non overlapping slicesSlice Manager for slice updates (out of order tuples) and window bordersAggregate Store computes and stores partial aggregates (eager and lazy)Window Manager combines aggregates and outputs windows

* Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez Cuellar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl. ICDE 2018.

Page 17: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

17 © DIMA 201817

17 © DIMA 2018

Out-of-Order Tuple Processing • Slice Manager keeps minimum

number of slices for out-of-order tuples

• Out-of-order tuple lead to updates• Sufficient to store one partial

aggregate per slice• Reduced memory footprint

Page 18: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

18 © DIMA 201818 © DIMA 2018

Stream Processing Systems

What makes a system a stream processing system?

Page 19: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

19 © DIMA 201819

19 © DIMA 2018

8 Requirements of Big Streaming• Keep the data moving

– Streaming architecture

• Declarative access– E.g. StreamSQL, CQL

• Handle imperfections– Late, missing, unordered items

• Predictable outcomes– Consistency, event time

• Integrate stored and streaming data– Hybrid stream and batch

• Data safety and availability– Fault tolerance, durable state

• Automatic partitioning and scaling– Distributed processing

• Instantaneous processing and response

The 8 Requirements of Real-Time Stream Processing – Stonebraker et al. 2005

Page 20: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

20 © DIMA 201820

20 © DIMA 2018

8 Requirements of Big Streaming• Keep the data moving

– Streaming architecture

• Declarative access– E.g. StreamSQL, CQL

• Handle imperfections– Late, missing, unordered items

• Predictable outcomes– Consistency, event time

• Integrate stored and streaming data– Hybrid stream and batch

• Data safety and availability– Fault tolerance, durable state

• Automatic partitioning and scaling– Distributed processing

• Instantaneous processing and response

The 8 Requirements of Real-Time Stream Processing – Stonebraker et al. 2005

Page 21: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

21 © DIMA 201821

21 © DIMA 2018

Big Data Processing• Databases can process very large data since forever (see VLDB)

– Why not use those?

• Big data is not (fully) structured – No good for database

• We want to learn more from data than just– Select, project, join

• First solution: MapReduce

Page 22: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

22 © DIMA 201822

22 © DIMA 2018

How to keep data moving?

Streamdiscretizer

Job Job Job Jobwhile (true) {// get next few records// issue batch computation

}

while (true) {// process next record

}

Long-standing operators

Discretized Streams (mini-batch)

Native streaming

Page 23: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

23 © DIMA 201823

23 © DIMA 2018

Discussion of Mini-Batch• Easy to implement• Easy consistency and fault-tolerance• Hard to do event time and sessions

Image: Tyler Akidau

Page 24: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

24 © DIMA 201824

24 © DIMA 2018

True Streaming Architecture

• Program = DAG* of operators and intermediate streams

• Operator = computation + state• Intermediate streams = logical stream of

records

• Stream transformations• Basic transformations: Map, Reduce, Filter,

Aggregations…• Binary stream transformations: CoMap, CoReduce…• Windowing semantics: Policy based flexible windowing

(Time, Count, Delta…)• Temporal binary stream operators: Joins, Crosses…• Native support for iterations

Page 25: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

25 © DIMA 201825

25 © DIMA 2018

Handle Imperfections – Watermarks• Data items arrive early, on-time, or late• Solution: Watermarks

– Perfect or heuristic measure on when window is complete

Image: Tyler Akidau

Page 26: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

26 © DIMA 201826

26 © DIMA 2018

Handle Imperfections – Watermarks• Data items arrive early, on-time, or late• Solution: Watermarks

– Perfect or heuristic measure on when window is complete

Image: Tyler Akidau

Image: Tyler Akidau

Page 27: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

27 © DIMA 201827

27 © DIMA 2018

Data Safety and Availability

• Ensure that operators see all events– “At least once”– Solved by replaying a stream from a checkpoint– No good for correct results

• Ensure that operators do not perform duplicate updates to their state– “Exactly once”– Several solutions

• Ensure the job can survive failure

27

Page 28: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

28 © DIMA 201828

28 © DIMA 2018

Lessons Learned from Batch

• If a batch computation fails, simply repeat computation as a transaction• Transaction rate is constant• Can we apply these principles to a true streaming execution?

batch-1batch-2

Page 29: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

29 © DIMA 201829

29 © DIMA 2018

Taking Snapshots – the naïve way

Initial approach (e.g., Naiad)• Pause execution on t1,t2,..• Collect state• Restore execution

t2t1

execution snapshots

Page 30: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

30 © DIMA 201830

30 © DIMA 2018

Asynchronous Snapshots in Flinkt2t1

snap - t1 snap - t2

snapshotting snapshotting

Propagating markers/barriers

[Carbone et. al. 2015] “Lightweight Asynchronous Snapshots for Distributed Dataflows”, Tech. Report. http://arxiv.org/abs/1506.08603

Page 31: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

31 © DIMA 201831

31 © DIMA 2018

Automatic partitioning and scaling• 3 Types of Parallelization

• Big streaming systems should support all three

Page 32: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

32 © DIMA 201832 © DIMA 2018

Apache Flink–A Success Story created in Berlin

Page 33: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

33 © DIMA 2018

• Relational Algebra• Declarativity• Query Optimization• Robust Out-of-core

• Scalability• User-defined

Functions • Complex Data Types• Schema on Read

• Iterations• Advanced Dataflows• General APIs• Native Streaming

33

Draws onDatabase Technology

Draws onMapReduce Technology

Adds

Stratosphere: General Purpose Programming + Database Execution

Page 34: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

34 © DIMA 201834

34 © DIMA 2018

Timeline

Page 35: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

35 © DIMA 201835

35 © DIMA 2018

What is Apache Flink?

Apache Flink is an open source platform for scalable batch and stream data processing.

http://flink.apache.org

• The core of Flink is a distributed streaming dataflow engine.

• Executing dataflows in parallel on clusters

• Providing a reliable foundation for various workloads

• DataSet and DataStream programming abstractions are the foundation for user programs and higher layers

Page 36: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

36 © DIMA 201836 © 2013 Berlin Big Data Center • All Rights Reserved

36 © DIMA 2018

What can I do with it?

A big data processing system that can natively support all these workloads.

Flink

Stream processing

Batchprocessing

Machine Learning at scale

Graph Analysis

Page 37: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

37 © DIMA 201837 © 2013 Berlin Big Data Center • All Rights Reserved

37 © DIMA 2018

Big Data Analytics Ecosystem

37

MapReduce

Hive

Flink

Spark Storm

Yarn Mesos

HDFS

Mahout

Cascading

Tez

Pig

Data processing engines

App and resource management

Applications &Languages

Storage, streams KafkaHBase

Crunch

Giraph

Page 38: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

38 © DIMA 201838

38 © DIMA 2018

Architecture• Hybrid MapReduce and MPP database runtime

• Pipelined/Streaming engine– Complete DAG deployed

Worker 1

Worker 3 Worker 4

Worker 2

Job Manager

Page 39: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

39 © DIMA 201839

39 © DIMA 2018

Sneak peak: Two of Flink’s APIs

39

case class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ").map(word => Word(word,1))}

.keyBy("word")

.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))

.sum("frequency”)

.print()

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap {line => line.split(" ").map(word => Word(word,1))}

.groupBy("word").sum("frequency")

.print()

DataSet API (batch):

DataStream API (streaming):

Page 40: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

40 © DIMA 201840 © 2013 Berlin Big Data Center • All Rights Reserved

40 © DIMA 2018

Yahoo! Benchmark ResultsPerformed by Yahoo! Engineering, Dec 16, 2015

[..]Storm 0.10.0, 0.11.0-SNAPSHOT and Flink 0.10.1 show sub- second latencies at relatively high

throughputs[..]. Spark streaming 1.5.1 supports high throughputs, but at a relatively higher latency.

Flink achieves highest throughput with competitive low latency!

Source: http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Page 41: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

41 © DIMA 201841 © 2013 Berlin Big Data Center • All Rights Reserved

41 © DIMA 2018

Our benchmarks*

Streaming

Windowed Aggregations

* Benchmarking Distributed Stream Data Processing Systems. Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. ICDE 2018

Page 42: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

42 © DIMA 201842 © DIMA 2018

Stream Processing on Modern Hardware

Page 43: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

43 © DIMA 201843

43 © DIMA 2018

Modern Hardware

Non-Volatile MemoryMulti-Core CPUs Fast Networks

Page 44: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

44 © DIMA 201844

44 © DIMA 2018

Scale Out vs. Scale Up Stream Processing

Scale Up SystemsScale Out Systems

Scale-Up: Operate a small cluster of nodes, keep all data in distributed main memory

Page 45: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

45 © DIMA 201845

45 © DIMA 2018

Modern Multi-Core CPUs• High Parallelism:

– Multiple cores (task parallelism): Multiple threads can perform different tasks at the same time

– Vector units (data parallelism): The same instruction is performed on multiple data items at once

• High Memory Bandwidth:– Aggregated memory bandwidth of 51.2GB/s per CPU (DDR3-

1600 memory with four channels, 12.8GB/s per channel)– Multiple processors are organized in NUMA (Non-Uniform

Memory Access) architecture– Cache coherent memory across all CPUs

Page 46: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

46 © DIMA 201846

46 © DIMA 2018

Modern Multi-Core CPUsTwo principle resource limitations:• Computation Bound:

– Executing many instructions per input tuple– Performing many function calls– Encountering many branch mispredictions

• Memory Bound:– Bound by Memory Latency:

• Random Memory Accesses (e.g., hash table operations)– Bound by Memory Bandwidth:

• Executing few instructions per input tuple• Reading input tuples sequentially with maximal memory speed

Page 47: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

47 © DIMA 201847

47 © DIMA 2018

Fast Networks• Infiniband:

– A new generation network protocol, native support for RDMA– Very high bandwidth (currently ~100Gbit per port)– Very small access latency to memory of remote machine

(~1 microsecond for InfiniBand FDR 4x)

• RDMA (Remote Direct Memory Access):– Network adapter can directly read or write to application memory of remote machine→ Avoids the overhead of copying data into OS buffers→Can access remote memory without consuming any CPU time in the remote machine

Page 48: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

48 © DIMA 201848

48 © DIMA 2018

Bandwidth of Different Network Technologies

Source: Following Binning et al. The End of Slow Networks: It’s Time for a Redesign. VLDB 2016.

New network technologies have similar bandwidth as main memory!

Page 49: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

49 © DIMA 201849

49 © DIMA 2018

Infiniband Future

Bandwidth of networks is going to be even larger than memory bandwidth

New streaming systems need to process streams with memory bandwidth to keep up

Page 50: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

50 © DIMA 201850

50 © DIMA 2018

Scale Up vs. Scale Out Stream Processing

Current streaming systems cannot saturate memory bandwidth, but hand optimized

implementations can!

Page 51: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

51 © DIMA 201851

51 © DIMA 2018

Non-Volatile Memory• Also called Storage Class Memory (SCM)

• Blurs the distinction between– Memory (= fast, expensive, volatile )– Storage (= slow, cheap, non-volatile)

• Byte-addressable; accessing NVRAM is similar to accessing DRAM

• Latencies are within the same order of magnitude as DRAM

• 10x higher density than DRAM, allows to keep more data (state) in-memory

Page 52: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

52 © DIMA 201852

52 © DIMA 2018

Non-Volatile Memory: Use Cases• Accelerate Checkpointing

– Use NVRAM to store checkpoints– Reduces checkpointing overhead during run-time– Accelerates starting time when a node comes up again

• New system architectures:– Keep all data in NVRAM, no redo recovery needed!– Very fast startup times compared to checkpointing-based systems– Cache frequently accessed data in RAM for fast access

Page 53: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

53 © DIMA 201853

53 © DIMA 2018

Non-Volatile Memory: Challenges• Any point crash recovery: byte-addressable persistency makes any write to

memory persistent→ System may crash at any time and writes (log file) may be incomplete→ Classic recovery techniques assume block-wise atomic writes for blocks on disk

• Hole detection: when a transaction just allocates chunks in NVRAM but has not written anything yet, there can be empty log records (holes) in the NVRAM log space

• Partial write detection: detect during recovery that transaction has not fully finished writing log data to NVRAM

Page 54: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

54 © DIMA 201854

54 © DIMA 2018

Towards Scale Up Streaming SystemsModern hardware allows us to built even faster streaming systems:

• Scale-Up architecture: operate a small cluster of nodes, which can keep all data and state in main memory

• Fast Networks: offer low latency and high bandwidth communication between nodes

• Reduced Logging Overhead: checkpoint application data in NVRAM

Page 55: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

55 © DIMA 201855

55 © DIMA 2018

ConclusionIntroduction to Streams• How to do real streaming

Stream Processing Systems• Ingredients of a stream processing system• Flink

Streaming on Modern Hardware• How to optimize

Future Work• Edge and fog• Geodistribution

Page 56: Scaling Stream Processing Out and Upgroppe/wdpar/... · * Scotty: Efficient Window Aggregation for out-of-order Stream Processing. Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez

56 © DIMA 2018© 2013 Berlin Big Data Center • All Rights Reserved56 © DIMA 2018

Thank You

Contact:Tilmann [email protected] We are hiring!