A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

31
A Data Streaming Architecture with Apache Flink Robert Metzger @rmetzger_ [email protected] Berlin Buzzwords, June 7, 2016

Transcript of A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Page 1: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

A Data Streaming Architecture with Apache Flink

Robert Metzger@[email protected]

Berlin Buzzwords,June 7, 2016

Page 2: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Talk overview My take on the stream processing space, and how

it changes the way we think about data Transforming an existing data analysis pattern into

the streaming world (“Streaming ETL”) Demo

2

Page 3: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Apache Flink Apache Flink is an open source stream processing

framework• Low latency• High throughput• Stateful• Distributed

Developed at the Apache Software Foundation, 1.0.0 released in March 2016,used in production

3

Page 4: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Entering the streaming era

4

Page 5: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

5

Streaming is the biggest change in data infrastructure

since Hadoop

Page 6: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

6

1. Radically simplified infrastructure2. Do more with your data, faster3. Can completely subsume batch

Page 7: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

7

Real-world data is produced in a continuous fashion.

New systems like Flink and Kafka embrace streaming

nature of data.Web server Kafka topic

Stream processor

Page 8: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Apache Flink stack

8

Gelly

Tabl

e / S

QL

ML

SAM

OA

DataSet (Java/Scala)DataStream (Java / Scala)

Hado

op M

/RLocalClusterYARN

Apac

he B

eam

Apac

he B

eam

Tabl

e /

Stre

amSQ

L

Casc

adin

g

Streaming dataflow runtimeSt

orm

API

Zepp

elin

CEP

Page 9: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

What makes Flink flink?

9

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Page 10: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Moving existing (batch) data analysis into streaming

10

Page 11: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Extract, Transform, Load (ETL) ETL: Move data from A to B and transform it on the

way Old approach:

Server LogsServer

Logs

Server Logs

Mobile

IoT

Page 12: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Extract, Transform, Load (ETL) ETL: Move data from A to B and transform it on the

way Old approach:

Server Logs

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Page 13: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Extract, Transform, Load (ETL) ETL: Move data from A to B and transform it on the

way Old approach:

Server Logs

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data Tier 1: Normalized, cleansed data

Periodic jobs Parquet /

ORC in HDFS

User

Page 14: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Extract, Transform, Load (ETL) ETL: Move data from A to B and transform it on the

way Old approach:

Server Logs

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data Tier 1: Normalized, cleansed data

Periodic jobs Parquet /

ORC in HDFS

Tier 2: Aggregated data

Periodic jobs

User

User

“Data Warehouse”

Page 15: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Page 16: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Stream Processor

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Kafka Connector

Tier 0: Raw data

Cleansing

Transformation

Time-Window

Alerts

Time-Window

Page 17: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Stream Processor

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 1: Normalized, cleansed data

Parquet /ORC in HDFSKafka

Connector

ES Connector

Rolling file sink

Tier 0: Raw data

Cleansing

Transformation

Time-Window

Alerts

Time-Window

User

Batch Processing

Page 18: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Stream Processor

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 1: Normalized, cleansed data

Parquet /ORC in HDFS

Tier 2: Aggregated data

User

Kafka Connector

ES Connector

Rolling file sink

JDBC sink

Cassandrasink

Tier 0: Raw data

Cleansing

Transformation

Time-Window

Alerts

Time-Window

User

Batch Processing

Page 19: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Streaming ETL: Low Latency

19

Less than 500 ms*

Less than 250 ms*

* Your mileage may vary. These are rule of thumb estimates.

Events are processed immediately No need to wait until the next “load” batch job is running

hours minutes milliseconds

Periodic batch job Batch processor with micro-batches

Latency

Approach

seconds

Stream processor

Page 20: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Streaming ETL: Event-time aware

20

Events derived from the same real-world activity might arrive out of order in the system

Flink is event-time aware

11:28 11:29

11:28 11:29

11:28 11:29

Same real-world activityOut of sync clocks Network delays Machine failures

Page 21: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Demo

21

Page 22: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Job Overview

22

Flink Twitter Source

Data Ingestion Job

“Streaming ETL” Job

Page 23: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Job Overview

23

(Rolling) file sinkFilter operationFilter operation

Aggregation to ElasticSearch

Streaming WordCount

TopN operator

Page 25: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Closing

25

Page 26: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

26

https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets-25580481910

Page 27: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Flink Forward 2016, Berlin

Submission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

Page 28: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

We are hiring!data-artisans.com/careers

Page 29: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Questions? Ask now! eMail: [email protected] Twitter: @rmetzger_

Follow: @ApacheFlink Read: flink.apache.org/blog, data-artisans.com/blog/ Mailinglists: (news | user | dev)@flink.apache.org

29

Page 30: A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Appendix

30