Fabian Hueske – Cascading on Flink

24
Cascading on Flink Fabian Hueske @fhueske

Transcript of Fabian Hueske – Cascading on Flink

Cascading on Flink

Fabian Hueske@fhueske

What is Cascading?“Cascading is the proven application development

platform for building data applications on Hadoop.”

(www.cascading.org)

Java API for large-scale batch processing

Programs are specified as data flows• pipes, taps, flow, cascade, …• each, groupBy, every, coGroup, merge, …

Open Source (AL2)• Developed by Concurrent

2

Cascading on MapReduce Originally for Hadoop MapReduce

Much better API than MapReduce• DAG programming model• Higher-level operators (join, coGroup, merge)• Composable and reusable code

Automatic translation to MapReduce jobs• Minimizes number of MapReduce jobs

Rock-solid execution due to Hadoop MapReduce3

Cascading Example

4

Compute TF-IDF scores for a set of documents• TF-IDF: Term-Frequency / Inverted-Document-Frequency• Used for weighting the relevance of terms in search

engines Building this against the MapReduce API is

painful!

Example taken from docs.cascading.org/impatient

Who uses Cascading? Runs in many production environments• Twitter, Soundcloud, Etsy, Airbnb, …

More APIs have been put on top• Scalding (Scala) by Twitter• Cascalog (Datalog)• Lingual (SQL)• Fluent (fluent Java API)

5

Cascading 3.0 Released in June 2015

A new planner• Execution backend can be changed

Apache Tez executor• Cascading programs are compiled to Tez jobs• No identity mappers• No writing to HDFS between jobs

6

Cascading on Flink

7

Why Cascading on Flink? Flink’s unique batch processing runtime• Pipelined data exchange• Actively managed memory on- & off-heap• Efficient in-memory & out-of-core operators• Sorting and hashing on binary data• No tuning for robust operation (OOME, GC)

YARN integration

8

Cascading on Flink Released Available on Github• Apache License V2

Depends on • Cascading 3.1 WIP• Flink 0.10-SNAPSHOT• Will be fixed to next releases of Cascading and Flink

Check Github for details:http://github.com/dataartisans/cascading-flink

9

Translation Details

10

Flow Translation Implemented on top of Java DataSet API

Using Cascading’s rule-based planner• Flow is compiled into a single Flink job• The operators of a job are partitioned into nodes

• Chaining of operators

Translation rules partition the flow if• Data is shuffled• Data is processed by Flink’s internal operators• Flows branch or merge• At sources and sinks

11

Operator Translation

12

Cascading operators have fixed execution strategy• No degree of freedom for Flink’s optimizer• Strategies fixed using hints for Flink’s optimizer

Cascading Operator Flink Operator(s)(n-ary) GroupBy (Union -) Reduce(n-ary) BufferJoin CoGroup (Union -) Reduce(n-ary) CoGroup (Sequence of) binary

hash-partitioned, sorted OuterJoin(n-ary) HashJoin (Sequence of) binary Broadcasted

HashJoin (n-ary) Merge n-ary UnionTap Source or Sink

Serializers & Comparators Flink needs information about all processed data

types• Generation of serializer and comparators

Cascading supports• Schema-less tuples (no length, no types)• Definition of key fields by name and (relative) position• Null-valued fields and key fields

Custom type information for Cascading tuples• Native serializers & comparators for fields with known

type• Kryo for unknown field types• Support for null values by wrapping serializers &

comparators 13

Going Out-of-Core Join and CoGroup must hold data in memory

• If data exceeds memory, we need to go to disk

Cascading on MR uses spillable collections• Spill to disk if #elements > threshold• Part of Cascading (not MapReduce)• Threshold either too low or too high

Cascading on Flink uses Flink’s Join and OuterJoin• Part of Flink (not Cascading)• Backed by Flink’s manage memory• Transparently spill to disk if necessary

14

Running Cascading on Flink

15

How to Run Cascading on Flink Add the cascading-flink Maven dependency to your

Cascading project• Available in Sonatype Nexus Repository• Or build it from source (Github)

Change just one line of code in your Cascading program• Replace Hadoop2MR1FlowConnector by FlinkConnector• Do not change any application logic

Execute Cascading program as regular Flink program

Detailed instructions on Github16

(Preliminary!) Performance Evaluation

8 worker node• 8 CPUs, 30GB RAM, 2 local SSDs

Hadoop 2.7.1 (YARN, HDFS, MapReduce)

Flink 0.10-SNAPSHOT 80GB generated text data

17

Baseline Wordcount

18

MapReduce native

Flink native

Cascading on MR

Cascading on Flink

0 2 4 6 8 10 12 14 16Execution Time (min)

Cascading on MR compiled to 1 MR job• Similar execution strategy (hash-partition, sort)• No significant speed gain expected

Verifies of our implementation Hash-Aggregators!

Something more complex: TF-IDF

Taken from “Cascading for the impatient”• 2 CoGroup, 7 GroupBy, 1 HashJoin

19http://docs.cascading.org/impatient

TF-IDF on MapReduce Cascading on MapReduce translates

the TF-IDF program to 9 MapReduce jobs

Each job• Reads data from HDFS• Applies a Map function• Shuffles the data over the network• Sorts the data• Applies a Reduce function• Writes the data to HDFS

20

TF-IDF on Flink Cascading on Flink translates the

TF-IDF job into one Flink job

21

TF-IDF on Flink

Shuffle is pipelined Intermediate results are not written to or

read from HDFS

22

Cascading on MR

Cascading on Flink

0 100 200 300 400 500 600Execution Time (min)

Conclusion Executing Cascading jobs on Apache Flink• Improves runtime• Reduces parameter tuning and avoids failures• Virtually no code changes

Apache Flink’s runtime is very versatile• Apache Hadoop MR• Apache Storm• Google Dataflow• Apache Samoa (incubating)• + Flink’s own APIs and libraries…

23

24