Flink Forward SF 2017: Ted Dunning - Non-Flink Machine Learning on Flink
Fabian Hueske – Cascading on Flink
-
Upload
flink-forward -
Category
Technology
-
view
5.814 -
download
0
Transcript of Fabian Hueske – Cascading on Flink
What is Cascading?“Cascading is the proven application development
platform for building data applications on Hadoop.”
(www.cascading.org)
Java API for large-scale batch processing
Programs are specified as data flows• pipes, taps, flow, cascade, …• each, groupBy, every, coGroup, merge, …
Open Source (AL2)• Developed by Concurrent
2
Cascading on MapReduce Originally for Hadoop MapReduce
Much better API than MapReduce• DAG programming model• Higher-level operators (join, coGroup, merge)• Composable and reusable code
Automatic translation to MapReduce jobs• Minimizes number of MapReduce jobs
Rock-solid execution due to Hadoop MapReduce3
Cascading Example
4
Compute TF-IDF scores for a set of documents• TF-IDF: Term-Frequency / Inverted-Document-Frequency• Used for weighting the relevance of terms in search
engines Building this against the MapReduce API is
painful!
Example taken from docs.cascading.org/impatient
Who uses Cascading? Runs in many production environments• Twitter, Soundcloud, Etsy, Airbnb, …
More APIs have been put on top• Scalding (Scala) by Twitter• Cascalog (Datalog)• Lingual (SQL)• Fluent (fluent Java API)
5
Cascading 3.0 Released in June 2015
A new planner• Execution backend can be changed
Apache Tez executor• Cascading programs are compiled to Tez jobs• No identity mappers• No writing to HDFS between jobs
6
Why Cascading on Flink? Flink’s unique batch processing runtime• Pipelined data exchange• Actively managed memory on- & off-heap• Efficient in-memory & out-of-core operators• Sorting and hashing on binary data• No tuning for robust operation (OOME, GC)
YARN integration
8
Cascading on Flink Released Available on Github• Apache License V2
Depends on • Cascading 3.1 WIP• Flink 0.10-SNAPSHOT• Will be fixed to next releases of Cascading and Flink
Check Github for details:http://github.com/dataartisans/cascading-flink
9
Flow Translation Implemented on top of Java DataSet API
Using Cascading’s rule-based planner• Flow is compiled into a single Flink job• The operators of a job are partitioned into nodes
• Chaining of operators
Translation rules partition the flow if• Data is shuffled• Data is processed by Flink’s internal operators• Flows branch or merge• At sources and sinks
11
Operator Translation
12
Cascading operators have fixed execution strategy• No degree of freedom for Flink’s optimizer• Strategies fixed using hints for Flink’s optimizer
Cascading Operator Flink Operator(s)(n-ary) GroupBy (Union -) Reduce(n-ary) BufferJoin CoGroup (Union -) Reduce(n-ary) CoGroup (Sequence of) binary
hash-partitioned, sorted OuterJoin(n-ary) HashJoin (Sequence of) binary Broadcasted
HashJoin (n-ary) Merge n-ary UnionTap Source or Sink
Serializers & Comparators Flink needs information about all processed data
types• Generation of serializer and comparators
Cascading supports• Schema-less tuples (no length, no types)• Definition of key fields by name and (relative) position• Null-valued fields and key fields
Custom type information for Cascading tuples• Native serializers & comparators for fields with known
type• Kryo for unknown field types• Support for null values by wrapping serializers &
comparators 13
Going Out-of-Core Join and CoGroup must hold data in memory
• If data exceeds memory, we need to go to disk
Cascading on MR uses spillable collections• Spill to disk if #elements > threshold• Part of Cascading (not MapReduce)• Threshold either too low or too high
Cascading on Flink uses Flink’s Join and OuterJoin• Part of Flink (not Cascading)• Backed by Flink’s manage memory• Transparently spill to disk if necessary
14
How to Run Cascading on Flink Add the cascading-flink Maven dependency to your
Cascading project• Available in Sonatype Nexus Repository• Or build it from source (Github)
Change just one line of code in your Cascading program• Replace Hadoop2MR1FlowConnector by FlinkConnector• Do not change any application logic
Execute Cascading program as regular Flink program
Detailed instructions on Github16
(Preliminary!) Performance Evaluation
8 worker node• 8 CPUs, 30GB RAM, 2 local SSDs
Hadoop 2.7.1 (YARN, HDFS, MapReduce)
Flink 0.10-SNAPSHOT 80GB generated text data
17
Baseline Wordcount
18
MapReduce native
Flink native
Cascading on MR
Cascading on Flink
0 2 4 6 8 10 12 14 16Execution Time (min)
Cascading on MR compiled to 1 MR job• Similar execution strategy (hash-partition, sort)• No significant speed gain expected
Verifies of our implementation Hash-Aggregators!
Something more complex: TF-IDF
Taken from “Cascading for the impatient”• 2 CoGroup, 7 GroupBy, 1 HashJoin
19http://docs.cascading.org/impatient
TF-IDF on MapReduce Cascading on MapReduce translates
the TF-IDF program to 9 MapReduce jobs
Each job• Reads data from HDFS• Applies a Map function• Shuffles the data over the network• Sorts the data• Applies a Reduce function• Writes the data to HDFS
20
TF-IDF on Flink
Shuffle is pipelined Intermediate results are not written to or
read from HDFS
22
Cascading on MR
Cascading on Flink
0 100 200 300 400 500 600Execution Time (min)
Conclusion Executing Cascading jobs on Apache Flink• Improves runtime• Reduces parameter tuning and avoids failures• Virtually no code changes
Apache Flink’s runtime is very versatile• Apache Hadoop MR• Apache Storm• Google Dataflow• Apache Samoa (incubating)• + Flink’s own APIs and libraries…
23