logo here company Your Pavan Kumar Kolamuri & Pragya...
Transcript of logo here company Your Pavan Kumar Kolamuri & Pragya...
Big Data Processing simplified with Apache Falcon
Pavan Kumar Kolamuri & Pragya MittalYour
company logo here
$whoami• Pavan Kolamuri– Senior Software Engineer, InMobi– Committer, Apache Falcon– Contributor, Apache Oozie
• Pragya Mittal– Software Engineer, InMobi– Committer, Apache Falcon– Contributor, Apache Oozie
•
What is in store for you ?
❖ Falcon Overview
❖ How Apache Falcon is used @ InMobi
❖ Falcon Unit
❖ Native Scheduler
❖ SLA Monitoring
❖ Feed LifeCycle
❖ Lineage Over Entities and Instances
❖ Data Import and Export
❖ Q&A
Once upon a time
Sample Pipeline
Click Logs Click Enhancer Enhanced Clicks
Hourly Aggregation Hourly
Clicks
Daily clicks
Daily Aggregation
Metadata
Retention : 2 hours Frequency : 5 minsLate Data arrival
Retention : 2 daysReplication required
Retention : 1 day
Retention : 7 days Replication required
What kept us up at night ?
❏ Failures
❏ Data arriving late
❏ Re-processing
❏ Varied Data Replication
❏ Varied Data Retention
❏ Data Archival
❏ Lineage
❏ SLA monitoring
The pattern
Some maladies cured
Data Management
Data Governance
Process Management
● Relays● Late Data Handling● Failure Retries ● Reruns
● Data Import/Export● Retention● Replication● Archival
● Lineage● Audit● SLA
The concoction
Concoction.. distributed
Sample pipeline in Falcon
Click Logs Click Enhancer Enhanced Clicks
Hourly Aggregation Hourly
Clicks
Daily clicks
Daily Aggregation
Metadata
Retention : 2 hours Frequency : 5 minsLate Data arrival
Retention : 2 daysReplication required
Retention : 1 day
Retention : 7 days Replication required
Falcon Feed
Falcon Process
Data management
Cluster Specification<cluster colo="default" description="" name="corp" xmlns="uri:falcon:cluster:0.1"> <tags>[email protected], [email protected], _department_type=forecasting</tags> <interfaces> <interface type="readonly" endpoint="webhdfs://localhost:14000" version="1.1.2"/> <interface type="write" endpoint="hdfs://localhost:9000" version="1.1.2"/> <interface type="execute" endpoint="localhost:8032" version="1.1.2"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.1.0"/> <interface type="registry" endpoint="thrift://localhost:12000" version="0.11.0"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <location name="temp" path="/tmp"/> <location name="working" path="/projects/falcon/working"/> </locations> <properties> <property name="field1" value="value1"/> </properties></cluster>
Feed Specification<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="corp" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}
/${MINUTE}"/> </locations> </cluster> </clusters> …..</feed>
Process Specification<process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process></process>
Why Falcon Unit ?
Motivation for Falcon Unit
• User errors caught only at deploy time.
– Input/Output feeds and paths not getting resolved.
– Errors in specification.
• Integration Tests require environment setup/tearDown.
– Messy deployment scripts.
– Time consuming.
• Debugging was cumbersome.
– Logs scattered.
SECTION DIVIDER TITLE
Falcon Unit
18
Falcon Unit
In Process execution env.● Local Oozie● Local File System● Local Job Runner● Local Message Queue
Actual cluster● Oozie● HDFS● YARN● Active MQ
Test suite
Before Falcon Unit
19
Falcon Feed
Falcon Process
Spec. Invalid
OK
OK
Invalid Output
Invalid Input
Improper Replication
OK
OK
Capabilities with example
• Entity creation and data flow validation.
• Data Injection.
• Data Retention and Replication.
• Seamless API for cluster and local mode.
Falcon Unit Example
• Process Submission:
– submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local
– submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode
• Process Scheduling:
– scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);
• Process Verification:– getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);
After Falcon Unit
22
Falcon Feed
Falcon Process
OK
OK
OK
OK
OK
OK
OK
OK
TESTED
OK
Native Scheduler
Why build a Native Scheduler ? Falcon uses Oozie for:
❖ DAG Execution
❖ Scheduling - Gaps exist
➢ Simple periodic scheduling without data gating.
➢ Cron + calendar based scheduling with/without data gating.
➢ Flexible data gating
➢ Support for a-periodic datasets and triggers based on data availability.
➢ Support for external triggers.
Scheduler - Before
Falcon Server Scheduler
Execution
25
Scheduler - The Plan
• Time based scheduling - Available in 0.9
• Data based gating - Will be available in .10
• Complete parity with Oozie and additional features - The release after
Scheduler - After
Falcon Server
DAG Executor
Execution
ANY DAG ExecutorS
ched
uler
27
Data Governance
▪ GraphDB - Data availability storage
▪ Lineage - Data flow
▪ SLA - Data monitoring
Graph DB▪ DAG notation.▪ Used to store information about entities , instances and
their dependencies.▪ Vertices - Entities/instances ▪ Edges - relationship among vertices▪ Graph APIs exposed to analyse metrics.▪ Pipeline Health check.
The following relationships are captured:
● owner of entities - User
● data classification tags
● groups defined in feeds
● Relationships between entities
○ Clusters associated with Feed and Process entity○ Input and Output feeds for a Process
● Instances refer to corresponding entities
GraphDB is exposed in 3 ways:
● REST API
● CLI
● Dashboard
Entity Graph
Instance Graph
LINEAGE
▪ Logical flow/movement of data from source to destination.▪ Falcon adds the ability to capture lineage for both
entities and its associated instances.▪ Captures the metadata tags associated with each of the
entities as relationships.
Entity dependency
It returns the graph depicting the relationship between the various processes and feeds in a given pipeline.
Instance dependency
Gives information about the producer and consumer for a particular instance.
SLA Monitoring
▪ Alerts based on data Availability▪ Two types of SLA
- SLA Low- SLA High
▪ Dashboard▪ Pluggable Alerting System
- Email- JMS Notifications
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="Yoda-Minutely" description="clicks log" xmlns="uri:falcon:feed:0.1">
<frequency>minutes(1)</frequency>
<sla slaLow="minutes(3)" slaHigh="minutes(5)"/>
<timezone>UTC</timezone>
<late-arrival cut-off="hours(6)"/>
<clusters>
<cluster name="uh1-gold" type="source">
<validity start="2015-11-29T10:28Z" end="2015-11-29T10:38Z"/>
<retention limit="months(9000)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/tmp/falcon-regression/FeedSLA/input/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/>
</locations>
<ACL owner="pragya" group="dataqa" permission="*"/>
<schema location="/schema/clicks" provider="protobuf"/>
</feed>
dataqa@lda01:~$ falcon entity -type feed -start 2015-11-04T08:03Z -slaAlert
name: Yoda-Minutely, type: FEED, cluster: uh1-gold, instanceTime: 2015-11-29T10:28Z,
tags: Missed SLA High
name: Yoda-Minutely, type: FEED, cluster: uh1-gold, instanceTime: 2015-11-29T10:29Z,
tags: Missed SLA High
name: Yoda-Minutely, type: FEED, cluster: uh1-gold, instanceTime: 2015-11-29T10:30Z,
tags: Missed SLA High
name: Yoda-Minutely, type: FEED, cluster: uh1-gold, instanceTime: 2015-11-29T10:31Z,
tags: Missed SLA Low
Audit Logs
▪ Falcon audits all user activity and captures them into a log file by default.
▪ Users may also extend the Falcon Audit plugin to send audits to systems like Apache Argus
Data Import and Export
Data Management Actions
Missing Piece
Data Import
Falcon Feed
● Different Modes of extraction : full or incremental● Different Modes of output (merge) : snapshot, append● Include/Exclude columns
RDBMS
HDFS
Data Export
Falcon Feed
● Different Modes of Load : insert, update-only● Include/Exclude columns
RDBMS
HDFS
What else is out there?
Better Authentication and Authorization
Even Better UI
Even Better monitoring
Process SLAs
Streaming support
A more powerful scheduler
Pipeline Recovery
Questions?
50
If you want to ask later - [email protected] you want to contribute - [email protected]