A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

A Data Streaming Architecture with Apache Flink

Robert Metzger@rmetzger_rmetzger@apache.org

Berlin Buzzwords,June 7, 2016

Talk overview My take on the stream processing space, and how

it changes the way we think about data Transforming an existing data analysis pattern into

the streaming world (“Streaming ETL”) Demo

Apache Flink Apache Flink is an open source stream processing

framework• Low latency• High throughput• Stateful• Distributed

Developed at the Apache Software Foundation, 1.0.0 released in March 2016,used in production

Entering the streaming era

Streaming is the biggest change in data infrastructure

since Hadoop

1. Radically simplified infrastructure2. Do more with your data, faster3. Can completely subsume batch

Real-world data is produced in a continuous fashion.

New systems like Flink and Kafka embrace streaming

nature of data.Web server Kafka topic

Stream processor

Apache Flink stack

DataSet (Java/Scala)DataStream (Java / Scala)

/RLocalClusterYARN

Streaming dataflow runtimeSt

What makes Flink flink?

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Moving existing (batch) data analysis into streaming

Extract, Transform, Load (ETL) ETL: Move data from A to B and transform it on the

way Old approach:

Server LogsServer

Server Logs

Mobile

way Old approach:

Server Logs

HDFS / S3

“Data Lake”

Server Logs

Mobile

Tier 0: Raw data

way Old approach:

Server Logs

HDFS / S3

“Data Lake”

Server Logs

Mobile

Tier 0: Raw data Tier 1: Normalized, cleansed data

Periodic jobs Parquet /

ORC in HDFS

way Old approach:

Server Logs

HDFS / S3

“Data Lake”

Server Logs

Mobile

Tier 0: Raw data Tier 1: Normalized, cleansed data

Periodic jobs Parquet /

ORC in HDFS

Tier 2: Aggregated data

Periodic jobs

“Data Warehouse”

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

“Data Lake”

Server Logs

Mobile

Tier 0: Raw data

Stream Processor

Server Logs

“Data Lake”

Server Logs

Mobile

Kafka Connector

Tier 0: Raw data

Cleansing

Transformation

Time-Window

Alerts

Time-Window

Stream Processor

Server Logs

“Data Lake”

Server Logs

Mobile

Tier 1: Normalized, cleansed data

Parquet /ORC in HDFSKafka

Connector

ES Connector

Rolling file sink

Tier 0: Raw data

Cleansing

Transformation

Time-Window

Alerts

Time-Window

Batch Processing

Stream Processor

Server Logs

“Data Lake”

Server Logs

Mobile

Tier 1: Normalized, cleansed data

Parquet /ORC in HDFS

Tier 2: Aggregated data

Kafka Connector

ES Connector

Rolling file sink

JDBC sink

Cassandrasink

Tier 0: Raw data

Cleansing

Transformation

Time-Window

Alerts

Time-Window

Batch Processing

Streaming ETL: Low Latency

Less than 500 ms*

Less than 250 ms*

* Your mileage may vary. These are rule of thumb estimates.

Events are processed immediately No need to wait until the next “load” batch job is running

hours minutes milliseconds

Periodic batch job Batch processor with micro-batches

Latency

Approach

seconds

Stream processor

Streaming ETL: Event-time aware

Events derived from the same real-world activity might arrive out of order in the system

Flink is event-time aware

11:28 11:29

Same real-world activityOut of sync clocks Network delays Machine failures

Job Overview

Flink Twitter Source

Data Ingestion Job

“Streaming ETL” Job

Job Overview

(Rolling) file sinkFilter operationFilter operation

Aggregation to ElasticSearch

Streaming WordCount

TopN operator

Demo code @ GitHub

https://github.com/rmetzger/flink-streaming-etl

Closing

https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets-25580481910

Flink Forward 2016, Berlin

Submission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

We are hiring!data-artisans.com/careers

Questions? Ask now! eMail: rmetzger@apache.org Twitter: @rmetzger_

Follow: @ApacheFlink Read: flink.apache.org/blog, data-artisans.com/blog/ Mailinglists: (news | user | dev)@flink.apache.org

Appendix

Sources

“Large scale ETL with Hadoop” http://www.slideshare.net/OReillyStrata/large-scale-etl-with-hadoop

A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Technology

Transcript of A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

Meetup Apache Flink en Madrid. Futuro de Apache Flink y su rivalidad con Spark Streaming

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Apache flink

Apache Flink – Distributed Stream Processing

Integrating Apache NiFi and Apache Flink

The Flink - Apache Bigtop integration

Apache Flink Deep Dive

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Stream Processing with Apache Flink - qconlondon.com · Apache Flink Apache Flink is an open source stream processing framework • Low latency • High throughput • Stateful •

Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Apache Flink Training - Deployment & Operations

Implementing BigPetStore with Apache Flink

Apache Flink: The Latest and Greatest...Apache Flink: The Latest and Greatest 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution The

Workshop Apache Flink Madrid

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink

Introduction to Apache Flink

SICS: Apache Flink Streaming

Apache Flink Meetup Berlin #6: Unified Batch & Stream Processing in Apache Flink

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015