logo here company Your Pavan Kumar Kolamuri & Pragya...

50
Big Data Processing simplified with Apache Falcon Pavan Kumar Kolamuri & Pragya Mittal Your company logo here

Transcript of logo here company Your Pavan Kumar Kolamuri & Pragya...

Page 1: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Big Data Processing simplified with Apache Falcon

Pavan Kumar Kolamuri & Pragya MittalYour

company logo here

Page 2: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

$whoami• Pavan Kolamuri– Senior Software Engineer, InMobi– Committer, Apache Falcon– Contributor, Apache Oozie

• Pragya Mittal– Software Engineer, InMobi– Committer, Apache Falcon– Contributor, Apache Oozie

Page 3: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

What is in store for you ?

❖ Falcon Overview

❖ How Apache Falcon is used @ InMobi

❖ Falcon Unit

❖ Native Scheduler

❖ SLA Monitoring

❖ Feed LifeCycle

❖ Lineage Over Entities and Instances

❖ Data Import and Export

❖ Q&A

Page 4: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Once upon a time

Page 5: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Sample Pipeline

Click Logs Click Enhancer Enhanced Clicks

Hourly Aggregation Hourly

Clicks

Daily clicks

Daily Aggregation

Metadata

Retention : 2 hours Frequency : 5 minsLate Data arrival

Retention : 2 daysReplication required

Retention : 1 day

Retention : 7 days Replication required

Page 6: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

What kept us up at night ?

❏ Failures

❏ Data arriving late

❏ Re-processing

❏ Varied Data Replication

❏ Varied Data Retention

❏ Data Archival

❏ Lineage

❏ SLA monitoring

Page 7: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

The pattern

Page 8: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Some maladies cured

Data Management

Data Governance

Process Management

● Relays● Late Data Handling● Failure Retries ● Reruns

● Data Import/Export● Retention● Replication● Archival

● Lineage● Audit● SLA

Page 9: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

The concoction

Page 10: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Concoction.. distributed

Page 11: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Sample pipeline in Falcon

Click Logs Click Enhancer Enhanced Clicks

Hourly Aggregation Hourly

Clicks

Daily clicks

Daily Aggregation

Metadata

Retention : 2 hours Frequency : 5 minsLate Data arrival

Retention : 2 daysReplication required

Retention : 1 day

Retention : 7 days Replication required

Falcon Feed

Falcon Process

Page 12: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Data management

Page 13: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Cluster Specification<cluster colo="default" description="" name="corp" xmlns="uri:falcon:cluster:0.1"> <tags>[email protected], [email protected], _department_type=forecasting</tags> <interfaces> <interface type="readonly" endpoint="webhdfs://localhost:14000" version="1.1.2"/> <interface type="write" endpoint="hdfs://localhost:9000" version="1.1.2"/> <interface type="execute" endpoint="localhost:8032" version="1.1.2"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.1.0"/> <interface type="registry" endpoint="thrift://localhost:12000" version="0.11.0"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <location name="temp" path="/tmp"/> <location name="working" path="/projects/falcon/working"/> </locations> <properties> <property name="field1" value="value1"/> </properties></cluster>

Page 14: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Feed Specification<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="corp" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}

/${MINUTE}"/> </locations> </cluster> </clusters> …..</feed>

Page 15: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Process Specification<process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process></process>

Page 16: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Why Falcon Unit ?

Page 17: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Motivation for Falcon Unit

• User errors caught only at deploy time.

– Input/Output feeds and paths not getting resolved.

– Errors in specification.

• Integration Tests require environment setup/tearDown.

– Messy deployment scripts.

– Time consuming.

• Debugging was cumbersome.

– Logs scattered.

Page 18: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

SECTION DIVIDER TITLE

Falcon Unit

18

Falcon Unit

In Process execution env.● Local Oozie● Local File System● Local Job Runner● Local Message Queue

Actual cluster● Oozie● HDFS● YARN● Active MQ

Test suite

Page 19: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Before Falcon Unit

19

Falcon Feed

Falcon Process

Spec. Invalid

OK

OK

Invalid Output

Invalid Input

Improper Replication

OK

OK

Page 20: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Capabilities with example

• Entity creation and data flow validation.

• Data Injection.

• Data Retention and Replication.

• Seamless API for cluster and local mode.

Page 21: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Falcon Unit Example

• Process Submission:

– submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local

– submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode

• Process Scheduling:

– scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);

• Process Verification:– getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);

Page 22: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

After Falcon Unit

22

Falcon Feed

Falcon Process

OK

OK

OK

OK

OK

OK

OK

OK

TESTED

OK

Page 23: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Native Scheduler

Page 24: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Why build a Native Scheduler ? Falcon uses Oozie for:

❖ DAG Execution

❖ Scheduling - Gaps exist

➢ Simple periodic scheduling without data gating.

➢ Cron + calendar based scheduling with/without data gating.

➢ Flexible data gating

➢ Support for a-periodic datasets and triggers based on data availability.

➢ Support for external triggers.

Page 25: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Scheduler - Before

Falcon Server Scheduler

Execution

25

Page 26: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Scheduler - The Plan

• Time based scheduling - Available in 0.9

• Data based gating - Will be available in .10

• Complete parity with Oozie and additional features - The release after

Page 27: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Scheduler - After

Falcon Server

DAG Executor

Execution

ANY DAG ExecutorS

ched

uler

27

Page 28: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Data Governance

Page 29: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

▪ GraphDB - Data availability storage

▪ Lineage - Data flow

▪ SLA - Data monitoring

Page 30: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Graph DB▪ DAG notation.▪ Used to store information about entities , instances and

their dependencies.▪ Vertices - Entities/instances ▪ Edges - relationship among vertices▪ Graph APIs exposed to analyse metrics.▪ Pipeline Health check.

Page 31: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

The following relationships are captured:

● owner of entities - User

● data classification tags

● groups defined in feeds

● Relationships between entities

○ Clusters associated with Feed and Process entity○ Input and Output feeds for a Process

● Instances refer to corresponding entities

Page 32: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

GraphDB is exposed in 3 ways:

● REST API

● CLI

● Dashboard

Page 33: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Entity Graph

Page 34: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Instance Graph

Page 35: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

LINEAGE

▪ Logical flow/movement of data from source to destination.▪ Falcon adds the ability to capture lineage for both

entities and its associated instances.▪ Captures the metadata tags associated with each of the

entities as relationships.

Page 36: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Entity dependency

It returns the graph depicting the relationship between the various processes and feeds in a given pipeline.

Page 37: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler
Page 38: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Instance dependency

Gives information about the producer and consumer for a particular instance.

Page 39: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler
Page 40: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

SLA Monitoring

▪ Alerts based on data Availability▪ Two types of SLA

- SLA Low- SLA High

▪ Dashboard▪ Pluggable Alerting System

- Email- JMS Notifications

Page 41: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<feed name="Yoda-Minutely" description="clicks log" xmlns="uri:falcon:feed:0.1">

<frequency>minutes(1)</frequency>

<sla slaLow="minutes(3)" slaHigh="minutes(5)"/>

<timezone>UTC</timezone>

<late-arrival cut-off="hours(6)"/>

<clusters>

<cluster name="uh1-gold" type="source">

<validity start="2015-11-29T10:28Z" end="2015-11-29T10:38Z"/>

<retention limit="months(9000)" action="delete"/>

</cluster>

</clusters>

<locations>

<location type="data" path="/tmp/falcon-regression/FeedSLA/input/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/>

</locations>

<ACL owner="pragya" group="dataqa" permission="*"/>

<schema location="/schema/clicks" provider="protobuf"/>

</feed>

Page 42: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

dataqa@lda01:~$ falcon entity -type feed -start 2015-11-04T08:03Z -slaAlert

name: Yoda-Minutely, type: FEED, cluster: uh1-gold, instanceTime: 2015-11-29T10:28Z,

tags: Missed SLA High

name: Yoda-Minutely, type: FEED, cluster: uh1-gold, instanceTime: 2015-11-29T10:29Z,

tags: Missed SLA High

name: Yoda-Minutely, type: FEED, cluster: uh1-gold, instanceTime: 2015-11-29T10:30Z,

tags: Missed SLA High

name: Yoda-Minutely, type: FEED, cluster: uh1-gold, instanceTime: 2015-11-29T10:31Z,

tags: Missed SLA Low

Page 43: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Audit Logs

▪ Falcon audits all user activity and captures them into a log file by default.

▪ Users may also extend the Falcon Audit plugin to send audits to systems like Apache Argus

Page 44: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Data Import and Export

Page 45: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Data Management Actions

Page 46: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Missing Piece

Page 47: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Data Import

Falcon Feed

● Different Modes of extraction : full or incremental● Different Modes of output (merge) : snapshot, append● Include/Exclude columns

RDBMS

HDFS

Page 48: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Data Export

Falcon Feed

● Different Modes of Load : insert, update-only● Include/Exclude columns

RDBMS

HDFS

Page 49: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

What else is out there?

Better Authentication and Authorization

Even Better UI

Even Better monitoring

Process SLAs

Streaming support

A more powerful scheduler

Pipeline Recovery

Page 50: logo here company Your Pavan Kumar Kolamuri & Pragya ...developermarch.com/developersummit/2016/report/...Falcon Overview How Apache Falcon is used @ InMobi Falcon Unit Native Scheduler

Questions?

50

If you want to ask later - [email protected] you want to contribute - [email protected]