SICS: Apache Flink Streaming

26
Introduction to stream processing with Apache Flink Seif Haridi KTH/ SICS

Transcript of SICS: Apache Flink Streaming

Page 1: SICS: Apache Flink Streaming

Introduction to stream processing with Apache Flink

Seif HaridiKTH/SICS

Page 2: SICS: Apache Flink Streaming

Stream processing

2Data Science Summit 2015

Page 3: SICS: Apache Flink Streaming

Why streaming

3

Data Warehouse

Batch

Data availability Streaming

2008 20152000

- Which data?- When?- Who?

Data Science Summit 2015 S. Haridi

Page 4: SICS: Apache Flink Streaming

3 Parts of a Streaming Infrastructure

4

Gathering Broker Analysis

SensorsTransactionlogs …

Server Logs

Data Science Summit 2015 S. Haridi

Page 5: SICS: Apache Flink Streaming

Example: Bouygues Telecom

5Data Science Summit 2015 S. Haridi

• Network and subscriber data gathered

• Added to Broker in raw format• Transformed and analyzed by

streaming engine• Stored back for further procesing

http://data-artisans.com/flink-at-bouygues.html

Page 6: SICS: Apache Flink Streaming

What is Apache Flink?

6Data Science Summit 2015

Page 7: SICS: Apache Flink Streaming

1 year of Flink - codeApril 2014 April 2015

Data Science Summit 2015 S. Haridi 7

Page 8: SICS: Apache Flink Streaming

What is Apache Flink

8

Distributed Data Flow Processing System

▪Focused on large-scale data analytics

▪Unified real-time stream and batch processing

▪Expressive and rich APIs in Java / Scala (+ Python)

▪Robust and fast execution backend

Reduce

Join

Filter

Reduce

MapIterate

Source

Sink

Source

Data Science Summit 2015 S. Haridi

Page 9: SICS: Apache Flink Streaming

Flink Stack

9

Gelly

Tabl

e

ML

SAM

OA

DataSet (Java/Scala) DataStream (Java/Scala)

Hado

op M

/R

Local Cluster Yarn

Tez

Embe

dded

Data

flow

Data

flow

Tabl

e

Streaming dataflow runtime

Stor

m

Zepp

elin

Data Science Summit 2015 S. Haridi

Page 10: SICS: Apache Flink Streaming

Stream Processing with Flink

10Data Science Summit 2015

Page 11: SICS: Apache Flink Streaming

What is Flink Streaming

11

Native, low-latency stream processor Expressive functional API Flexible operator state, iterations,

windows Exactly-once processing semantics

Data Science Summit 2015 S. Haridi

Page 12: SICS: Apache Flink Streaming

Native vs non-native streaming

12

Streamdiscretizer

Job Job Job Jobwhile (true) { // get next few records // issue batch computation}

Non-native streaming

while (true) { // process next record}

Long-standing operators

Native streaming

Data Science Summit 2015 S. Haridi

Page 13: SICS: Apache Flink Streaming

Stream processing in Flink Continuous Streaming model Low processing latency O(1) state updates per operator Exactly once semantics for state

operators

Data Science Summit 2015 S. Haridi 13

Page 14: SICS: Apache Flink Streaming

DataStream API

14Data Science Summit 2015

Page 15: SICS: Apache Flink Streaming

15

Overview of the API

Data Science Summit 2015 S. Haridi

Page 16: SICS: Apache Flink Streaming

Windowing Semantics

16

• Trigger and Eviction policies• window(<eviction>).every(<trigger>)

• Built-in policies:– Time: Time.of(length, TimeUnit/Custom timestamp)

– window(Time.of(20, SECONDS))

– Count: Count.of(windowSize)

– window(Count.of(20)).every(Count.of(10))

– Delta: Delta.of(Threshold, Distance function, Start value)

– window(Delta.of(0.1, priceDistanceFun, initPrice)

Data Science Summit 2015 S. Haridi

Page 17: SICS: Apache Flink Streaming

17

Word count in Batch and Streaming

case class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .keyBy("word”).window(Time.of(5,SECONDS))

.every(Time.of(1,SECONDS)).sum("frequency") .print()

val lines: DataSet[String] = env.readTextFile(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()

DataSet API (batch):

DataStream API (streaming):

Data Science Summit 2015 S. Haridi

Page 18: SICS: Apache Flink Streaming

Flexible windows

18More at: http://flink.apache.org/news/2015/02/09/streaming-example.html

Keyed StreamWindowed StreamData Stream Keyed StreamWindowed Stream Stream of stocks Trigger warning if price fluctuates by 5% Count the number of warnings per stock

in 30 second (tumbling) window Do it continuously

Data Science Summit 2015 S. Haridi

StockStream

Delta 5% of price Warning Count 30 sec

window Sum

keyBysymbol

keyBysymbol

Page 19: SICS: Apache Flink Streaming

Flexible windows

19More at: http://flink.apache.org/news/2015/02/09/streaming-example.html

case class Count(symbol: String, count: Int)val defaultPrice = StockPrice(“”, 1000)val priceWarnings = stockStream.keyBy(“symbol”) .window(Delta.of(0.05, priceChange, defaultPrice)

.mapWindow(sendWarning _)

Use delta policy to createchange warnings

Count number of warning per stock every half a minute

val warningPerStock = priceWarnings.flatten()

.map(Count(_, 1))

.keyBy(“symbol”)

.window(Time.of(30, SECONDS))

.sum(“count”) Data Science Summit 2015 S. Haridi

StockStream

Delta 5% of price Warning Count 30 sec

window Sum

keyBysymbol

keyBysymbol

Page 20: SICS: Apache Flink Streaming

Iterative stream processing

20

Motivation Many applications require cyclic

streams Machine learning applications

(parallel model training, evaluation)

Iterations in Flink Streaming Native support for cyclic dataflows Integrated with functional API High performance and expressivity

Input

Train

Evaluate

Data Science Summit 2015 S. Haridi

Page 21: SICS: Apache Flink Streaming

Fault tolerance

21Data Science Summit 2015

Page 22: SICS: Apache Flink Streaming

Exactly-once processing in for operator state

22

Based on consistent global snapshots Low runtime overhead, stateful

exactly-once semantics

Data Science Summit 2015 S. Haridi

Page 23: SICS: Apache Flink Streaming

Checkpointing / Recovery

23

Detailed algorithm: Lightweight Asynchronous Snapshots for Distributed DataflowsData Science Summit 2015 S. Haridi

Page 24: SICS: Apache Flink Streaming

Fault tolerance Check-pointing and recovery of operator

state is very fast• Data processing does not block

Executions based on CPU/operator time are not idempotent

Other execution modes are based on timestamps of input streams (Event/Ingress time) • Allows idempotent executions • End-to-End exactly-once semantics• In Flink version 0.10

24Data Science Summit 2015 S. Haridi

Page 25: SICS: Apache Flink Streaming

Streaming in Apache Flink True streaming over stateful

distributed dataflow engine Expressive Streaming API in

Java/Scala• Flexible window semantics• Iterative computation

Low streaming latency, exactly-once semantics depending on execution mode, and low overhead for recovery

25Data Science Summit 2015 S. Haridi

Page 26: SICS: Apache Flink Streaming

Special Thanks to

Gyula Fora, SICSParis Carbone, KTHKostas Tzoumas, Data ArtisansStephan Ewen, Data ArtisansVolker Markl, TU-Berlin

26Data Science Summit 2015