Apache Flink - Overview

http://flink.incubator.apache.org@ApacheFlink

Apache FlinkNext-Gen Data Analytics Platform

What is Apache Flink

An efficient distributed general-purpose data analysis platform.

Runs on top ofHDFS and YARN.

Focusing on ease of programming.

Project status• Incubation at Apache Software Foundation

• Version 0.6 stable, 0.7-SNAPSHOT

• Came out of the Stratosphere research project

• Growing community of contributors

Flink Stack

JavaAPI

Storage

Flink Runtime

HDFS Local Files S3

ClusterManager

YARN EC2 Direct

Flink Optimizer

ScalaAPI

Spargel(graphs)

SQL,Python

...JDBC

HadoopMapReduce

Hive ...

Key Features

Easy to use developer APIs• Java, Scala, Graphs, Nested Data

(Python & SQL under development) • Flexible composition of large

programs

Automatic Optimization• Join algorithms• Operator chaining• Reusing partitioning/sorting

High Performance Runtime• Complex DAGs of operators• In memory & out-of-core• Data streamed between operations

Native Iterations• Embedded in the APIs• Data streaming / in-memory• Delta iterations speed up many

programs by orders of mag.

Flink Overview

DataSet<String> text = env.readTextFile(input);

DataSet<Tuple2<String, Integer>> result = text.flatMap(new Splitter()).groupBy(0).aggregate(SUM, 1);

// map function implementationclass Splitter extends FlatMap<String, Tuple2<String, Integer>> {

public void flatMap(String value, Collector out){ for (String token : value.split("\\W")) { out.collect(new Tuple2<String, Integer>(token, 1)); } }}

Concise & rich APIs

Word Count, Java API

Can use regular POJOs!

Concise & rich APIs

val input = TextFile(textInput)val words = input flatMap { line => line.split("\\W+") }val counts = words groupBy { word => word } count()

Word Count, Scala API

Flexible Data Pipelines

Reduce

Iterate

Source

Rich Building Blocks

• Map, FlatMap, MapPartition• Filter, Project• Reduce, ReduceGroup, Aggregate, Distinct• Join• CoGoup• Cross• Iterate• Iterate Delta• Graph Vertex-Centric (Pregel style)

DataSet<Tuple...> large = env.readCsv(...);DataSet<Tuple...> medium = env.readCsv(...);DataSet<Tuple...> small = env.readCsv(...);

DataSet<Tuple...> joined1 = large.join(medium).where(3).equals(1) .with(new JoinFunction() { ... });

DataSet<Tuple...> joined2 = small.join(joined1).where(0).equals(2) .with(new JoinFunction() { ... });

DataSet<Tuple...> result = joined2.groupBy(3).aggregate(MAX, 2);

Example: Joins in Flink

Built-in strategies include partitioned join and replicated join with local sort-merge or hybrid-hash algorithms.

⋈⋈

large medium

DataSet<Tuple...> large = env.readCsv(...);DataSet<Tuple...> medium = env.readCsv(...);DataSet<Tuple...> small = env.readCsv(...);

DataSet<Tuple...> joined1 = large.join(medium).where(3).equals(1) .with(new JoinFunction() { ... });

DataSet<Tuple...> joined2 = small.join(joined1).where(0).equals(2) .with(new JoinFunction() { ... });

DataSet<Tuple...> result = joined2.groupBy(3).aggregate(MAX, 2);

Automatic Optimization

Possible execution 1) Partitioned hash-join

3) Grouping /Aggregation reuses the partitioningfrom step (1) No shuffle!!!

2) Broadcast hash-join

Partitioned ≈ Reduce-sideBroadcast ≈ Map-side

Running Programs

> bin/flink run prg.jar

Packaged ProgramsRemote EnvironmentLocal Environment

Program JAR file

master master

RPC &Serialization

RemoteEnvironment.execute()LocalEnvironment.execute()

Spawn embeddedmulti-threaded environment

Flink Runtime

Distributed Runtime• Master (Job Manager)

handles job submission, scheduling, and metadata

• Workers (Task Managers) execute operations

• Data can be streamed between nodes

• All operators startin-memory and graduallygo out-of-core

Runtime Architecture(comparison)

DistributedCollection

List[WC]

public class WC { public String word; public int count;}

emptypage

Pool of Memory Pages• Collections of objects• General-purpose

serializer (Java / Kryo)• Limited control over

memory & less efficientspilling

• Deserialize all or nothing

• Works on pages of bytes• Maps objects transparently to these pages• Full control over memory, out-of-core enabled• Algorithms work on binary representation• Address individual fields (not deserialize whole object)

Iterative Programs

Why Iterative Algorithms

• Algorithms that need iterationso Clustering (K-Means, Canopy, …)o Gradient descent (e.g., Logistic Regression, Matrix Factorization)o Graph Algorithms (e.g., PageRank, Line-Rank, components, paths,

reachability, centrality, )o Graph communities / dense sub-componentso Inference (believe propagation)o …

• Loop makes multiple passes over the data

Iterations in other systems

Step Step Step Step Step

ClientLoop outside the system

Step Step Step Step Step

ClientLoop outside the system

Iterations in Flink

Streaming dataflowwith feedback

System is iteration-aware, performs automatic optimization

Automatic Optimization for Iterative Programs

Caching Loop-invariant DataPushing work„out of the loop“

Maintain state as index

Unifies various kinds of Computations

ExecutionEnvironment env = getExecutionEnvironment();

DataSet<Long> vertexIds = ...DataSet<Tuple2<Long, Long>> edges = ...

DataSet<Tuple2<Long, Long>> vertices = vertexIds.map(new IdAssigner());

DataSet<Tuple2<Long, Long>> result = vertices .runOperation( VertexCentricIteration.withPlainEdges( edges, new CCUpdater(), new CCMessager(), 100));

result.print();env.execute("Connected Components");

Pregel/Giraph-style Graph Computation

Delta Iterations speed up certain problems

by a lot

200000

400000

600000

800000

1000000

1200000

1400000

Iteration

Twitter Webbase (20)0

Computations performed in each iteration for connected communities of a social graph

Runtime (secs)

Cover typical use cases of Pregel-like systems with comparable performance in a generic platform and developer API.

Program Optimization

Why Program Optimization ?

Do you want to hand-optimize that?

What is Automatic Optimization

Run on a sampleon the laptop

Run a month laterafter the data evolved

Hash vs. SortPartition vs. BroadcastCachingReusing partition/sortExecution

Plan A

ExecutionPlan B

Run on large fileson the cluster

ExecutionPlan C

Using Flink

http://flink.incubator.apache.org

Its easy to get started…

wget http://www.apache.org/dyn/closer.cgi/incubator/flink/flink-0.6-incubating-bin-hadoop2-yarn.tgz

tar xvzf flink-0.6-incubating-bin-hadoop2-yarn.tgz ./flink-0.6-incubating/bin/yarn-session.sh -n 4 -jm 1024 -tm 3000

If you have YARN, deploy a full Flink setup in 3 commands:

Also works on Amazon Elastic MapReduce ;-)

Quickstart projects set up a program skeleton, includingembedded local execution/debugging environment…

For the experts…

trying it out…

$ wget https://.../flink-0.6-incubating.tgz$ tar xzf flink-*.tgz$ flink/bin/start-local.sh

running a local pseudo-clusterAlso available as aDebian package

Tutorial Example

flink.incubator.apache.org

github.com/apache/incubator-flink

@ApacheFlink

Where to find us

Apache Flink - Overview

Technology

Transcript of Apache Flink - Overview

Apache Flink Hands On

Apache Flink Training - Metrics & Monitoring

Stream Processing with Apache Flink - qconlondon.com · Apache Flink Apache Flink is an open source stream processing framework • Low latency • High throughput • Stateful •

Efficient Datastream Sampling on Apache Flink · For the sampling task all implemented algorithms perform as stream operators in the Apache Flink framework. Apache Flink [2] is an

Integrating Apache NiFi and Apache Flink

Apache Flink Training - Advanced Windowing

Introduction to Apache Flink

Apache Flink Training: System Overview

Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Apache Flink Overview at Stockholm Hadoop User Group

Workshop Apache Flink Madrid

Apache Flink Stream Processing

Apache Flink Meetup Berlin #6: Unified Batch & Stream Processing in Apache Flink

Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview

FastR+Apache Flink

Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Integrations and Use Cases

Apache Flink - tutorialspoint.comApache Flink was founded by Data Artisans company and is now developed under Apache License by Apache Flink Community. This community has over 479

Apache Flink internals

Apache flink 1.0.0 overview