Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing...

31
Data Management in Large-Scale Distributed Systems Stream processing Thomas Ropars [email protected] http://tropars.github.io/ 2019 1

Transcript of Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing...

Page 1: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Data Management in Large-Scale DistributedSystems

Stream processing

Thomas Ropars

[email protected]

http://tropars.github.io/

2019

1

Page 2: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

References

• The lecture notes of V. Leroy

• Designing Data-Intensive Applications by Martin KleppmannI Chapter 11

2

Page 3: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

In this lecture

• Introduction to Stream Processing

• Transmitting data streams

• An overview of Stream Processing engines

3

Page 4: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Agenda

Introduction

Transmitting event streams: Message brokers

Stream processing

The lambda architecture

4

Page 5: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

An unbounded dataset

Batch processing

• Data are stored in files

• Process the whole data at once

• Examples: Hadoop MapReduce, Spark, etc.

In many use-cases, new data are generated continuously

• Data from sensors

• Data from social networks

• Web traffic

• Etc.

5

Page 6: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Near real-time processing

In many cases, data should be processed as early as possible:

• Detecting fraudulent behavior

• Identifying malfunctioning systems

• Monitoring trends (social networks, system load)

Adapting batch processing systems?

• Processing all the data of the day at the end of each day

I High latency §

• We need to process data more frequentlyI This is what stream processing engines do

6

Page 7: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Near real-time processing

In many cases, data should be processed as early as possible:

• Detecting fraudulent behavior

• Identifying malfunctioning systems

• Monitoring trends (social networks, system load)

Adapting batch processing systems?

• Processing all the data of the day at the end of each day

I High latency §

• We need to process data more frequentlyI This is what stream processing engines do

6

Page 8: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Near real-time processing

In many cases, data should be processed as early as possible:

• Detecting fraudulent behavior

• Identifying malfunctioning systems

• Monitoring trends (social networks, system load)

Adapting batch processing systems?

• Processing all the data of the day at the end of each day

I High latency §

• We need to process data more frequentlyI This is what stream processing engines do

6

Page 9: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Stream vs Batch processing

Batch processing

• Good for analyzing a static dataset

• Focuses on throughput

• Allows running complex analysis requiring multiple iterationson the data

Stream processing

• Good to analyze live data

• Continuously updates results based on new data

• Focuses on latency (between data production and update ofthe results)

• Processes data once

7

Page 10: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Stream processing computation

Stream analytics

• Measuring event rates

• Computing rolling statistics (average, histograms, etc)

• Comparing statistics to previous values (detecting trends)

• Sampling data

• Filtering data

• Applying basic machine learning algorithms

8

Page 11: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Questions related to stream processing

• How to transmit data from the producers to the consumers?I With multiple producers and/or consumers?

• How to process events in a distributed way?

• How to deal with failures?

• How to reason about time?

• How to maintain a state over time?

9

Page 12: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Agenda

Introduction

Transmitting event streams: Message brokers

Stream processing

The lambda architecture

10

Page 13: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Motivation

sensor

web traffic

sensor

Producers Message broker Consumers

11

Page 14: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Challenges

Routing messages

• Some consumers are only interested in some messages

• Some messages are useful for multiple consumers

Performance• Amount of produced data might be huge

• Data might me produced faster than they are processed

Fault tolerance• Clients might connect/disconnect at any time

12

Page 15: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Log-based message broker

Main principles

• Maintain a log of all the messages receivedI Append-only sequence of records on disk

• Each record is identified with a sequence number

• The offset of each client in the log can be stored

Existing systems

• Apache Kafka

• Amazon Kinesis Data Streams

13

Page 16: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Kafkahttps://kafka.apache.org/

• Originally developed at LinkedIn

• Open-source

• Used by many companies

14

Page 17: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Kafka main principles

A partitioned log

The log is divided into multiple partitions

• Each partition has its own monotonically increasing sequencenumber

• Partitions can be hosted on different machines

Advantages of logs

• Old records can be replayedI Clients can arrive late or disconnectI Question: How to do garbage collection?

• Note that filling a 6TB disk takes half a day

• Data are buffered in the logI Deal with the case where the consumers are slower than the

producers

15

Page 18: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Kafka communication abstractions

Topics

• Partitions are grouped into logical topics

• A producer can send to multiple topics

Consumer groups

• Consumers gather themselves into consumer groups

• A consumer group can register to one or several topics

• Multiple groups can register to the same topic• Each record is delivered to one consumer in each registered

groupI Records from one partition are send to exactly one consumer in

a groupI Total-order delivery is only ensured inside partitions

16

Page 19: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Kafka consumer groupssource https://kafka.apache.org/documentation/#gettingStarted

2 types of communication patterns

• Load balancing

• Broadcasting

17

Page 20: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Kafka fault tolerance

Data availability

• A Kafka cluster spans multiple nodes

• Partitions are replicated on multiple nodes

Dealing with consumer disconnections/failures

• Offset of the consumer in the log partition is recordedpermanently

• The same/another consumer can start processing records fromthis point

• Provided delivery semantics:I At-least-onceI At-most-once

18

Page 21: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Agenda

Introduction

Transmitting event streams: Message brokers

Stream processing

The lambda architecture

19

Page 22: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Stream processing engines

Description

• A set of transformations is applied to a stream of recordsI A program is a graph of transformations (Directed acyclic

graph)I Transformations are the same operations as in batch

processing systems

Examples

• Storm

• Flink

• Samza

• Spark streaming

20

Page 23: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Graph of transformations (Flink)source https://ci.apache.org/projects/flink/flink-docs-release-1.6/

concepts/programming-model.html

SourceDataStream<String> lines = env.addSource(

new FlinkKafkaConsumer<>(…));

DataStream<Event> events = lines.map((line) -> parse(line));

DataStream<Statistics> stats = events

.keyBy("id")

.timeWindow(Time.seconds(10))

.apply(new MyWindowAggregationFunction());

stats.addSink(new RollingSink(path));

Source map()

Transformation

Transformation

SourceOperator

keyBy()/window()/apply()

Sink

TransformationOperators

SinkOperator

Stream

Sink

Streaming Dataflow

21

Page 24: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Parallel Dataflow (Flink)

Source map()keyBy()/window()/apply()

Sink

OperatorSubtask

Source[1]

map()[1]

keyBy()/window()/apply()

[1]

Sink[1]

Source[2]

map()[2]

keyBy()/window()/apply()

[2]

StreamPartition

Operator Stream

Streaming Dataflow(parallelized view)

Streaming Dataflow(condensed view)

parallelism = 1

parallelism = 2

22

Page 25: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

About the notion of time

To run computations on a continuous stream, it has to be splitinto windows.

Size of the windows• 1 event: Each event is processed separately (Storm)• Windows limits is defined based on:

I Amount of data receivedI TimeI Activity (concept of sessions)

2 reference times co-exists in the system

• Event time: time at which the events happened• Processing time: time at which the events are processed

I Most systems build windows based on the processing time

23

Page 26: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

About the notion of time

Window type

• Tumbling window: Fixed size window, each event belong toone window

• Hopping window: Fixed size window, windows overlapI hop size = time between the generation of two windowsI hop size < window size

• Sliding window: Fixed size windowI A new window is considered at each time step

• Session window: No fix size, group together events thathappened closely together in time

24

Page 27: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Spark Streaming

Based on micro-batches• The data stream is divided into micro-batches

I Tumbling windowsI Typically 1 to 4 seconds

• Each micro-batch is a RDD• Multiple receivers can be created to manipulate multiple data

streams in parallelI The receiver tasks are distributed over the workers

25

Page 28: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Agenda

Introduction

Transmitting event streams: Message brokers

Stream processing

The lambda architecture

26

Page 29: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

The Lambda architecturesource https://www.oreilly.com/ideas/questioning-the-lambda-architecture

27

Page 30: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

The Lambda architecture

Motivations• Combination of batch processing and stream processing in a

single architectureI Stream processing allows building fast (approximate) views of

the dataI Batch processing is used for more complex (and accurate) data

analysis

LimitsArchitecture becoming less popular (lambda-less architecture)

• Maintaining two code bases is costly• Processing engines start allowing doing both (Spark, Flink)

I Stream processing engines are becoming more mature, theyallow running more complex computations

I Log-based message brokers allow processing the same recordmultiple times

28

Page 31: Data Management in Large-Scale Distributed Systems - Stream … · 2020-03-23 · Designing Data-Intensive Applications by Martin Kleppmann I Chapter 11 2. ... I Most systems build

Additional references

Mandatory reading

• https://www.oreilly.com/ideas/

the-world-beyond-batch-streaming-101, T. Akidau,2015.

Suggested reading

• Apache Flink: Stream and Batch Processing in a SingleEngine., P. Carbone et al., IEEE, 2015.

• https://www.oreilly.com/ideas/

questioning-the-lambda-architecture, J. Kreps, 2014.

• https://data-artisans.com/blog/

batch-is-a-special-case-of-streaming

29