Building Complex Data Workflows with Cascading on...

37
Building Complex Data Workflows with Cascading on Hadoop Gagan Agrawal, Sr. Principal Engineer, Snapdeal

Transcript of Building Complex Data Workflows with Cascading on...

Page 1: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Building Complex Data Workflows with Cascading on Hadoop

Gagan Agrawal, Sr. Principal Engineer, Snapdeal

Page 2: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Agenda

• What is Cascading• Why use Cascading• Cascading Building Blocks• Example Workflows• Testing• Advantages / Disadvantages• Cascading @ Snapdeal

Page 3: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

What is Cascading ?

Page 4: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

What is Cascading ?

• Abstraction over Map Reduce• API for data-processing workflows• JVM framework and SDK for creating abstracted

data flows• Translates data flows into actual

Hadoop/RDBMS/Local jobs

Page 5: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

Page 6: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Options in Hadoop

– Map Reduce API– Pig– Hive

Page 7: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Map Reduce is

– Low Level API– Requires lot of boilerplate code– Can be difficult to chain jobs

Page 8: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Pig

– Good for scripting– Can be difficult to manage multi-module

workflows– Can be difficult to Unit Test– UDF written in different language (Java,

Python (via streaming) etc.)

Page 9: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Hive

– Good for simple queries– Complex workflows can be very difficult to

implement

Page 10: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Hive

– Good for simple queries– Complex workflows can be very difficult to

implement

Page 11: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Cascading

– Helps create higher level data processing abstractions

– Sophisticated data pipelines– Re-usable components

Page 12: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Building Blocks

Page 13: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Building Blocks

• Tap– Source– Sink

• Pipes– Each– Every– Merge– GroupBy– CoGroup– HashJoin

Page 14: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Building Blocks

• Tap– Source– Sink

• Pipes– Each– Every– Merge– GroupBy– CoGroup– HashJoin

Page 15: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Building Blocks

• Tap– Source– Sink

• Pipes– Each– Every– Merge– GroupBy– CoGroup– HashJoin

Page 16: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Building Blocks

• Functions• Filter• Aggregator• Buffer• Sub Assemblies• Flows• Cascades•

Page 17: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Terminologies

• Flow – A path for data with some number of inputs,

some operations, and some outputs• Cascade

– A series of connected flows• Operation

– A function applied to data, yielding new data•

Page 18: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Terminologies

• Pipe– Moves data from some place to some other

place• Tap

– Feeds data from outside the flow into it and writes data from inside the flow out of it.

Page 19: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Pipe Assemblies

• Define work against a Tuple Stream• May have multiple sources and sinks

– Splits– Merges– Joins

Page 20: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Pipe

Page 21: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Example Workflows

Page 22: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Distributed Copy

Page 23: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Word Count

Page 24: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Custom Function

• A function expects a stream of individual tuples• Returns zero or more tuples• Used with Each pipe• To create custom function

– Sublcass cascading.operation.BaseOperation– Implement cascading.operation.Function

Page 25: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Custom Function

Page 26: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Joins

• CoGroup• HashJoin

Page 27: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Joins

Page 28: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

TF-IDF Example

Page 29: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Testing

Page 30: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Testing

• Create flows entirely in code on a local machine• Write tests for controlled sample data sets• Run tests as a regular old Java without needing

access to actual Hadoop or databases• Local machine and CI testing are easy

Page 31: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Advantages / Disadvantages

Page 32: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Advantages

• Re-usability– Pipe assemblies are designed for reuse– Once created and tested, use them in other

flows– Write logic to do something only once

• Simple Stack– Cascading creates DAG of dependent jobs

for us– Removes most of the need for oozie

Page 33: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Advantages

• Simpler Stack– Keeps track of where a flow fails and can

rerun from that point on failure• Common Code Base

– Since everything is written in java, everybody can use same terms and same tech

Page 34: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Disadvantages

• JVM Based– Java / Scala / Clojure

• Doesn't have job scheduler– Can figure out dependency graph for jobs,

but nothing to run them on regular interval– We still need scheduler like quartz

Page 35: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Disadvantages

• No real built-in monitoring• Easy to have a flow report what it has done;

hard to watch it in progress

Page 36: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading @ Snapdeal

Page 37: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –