Apache Falcon at Hadoop Summit Europe 2014
-
Upload
venkatesh-seetharam -
Category
Technology
-
view
1.504 -
download
2
description
Transcript of Apache Falcon at Hadoop Summit Europe 2014
![Page 1: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/1.jpg)
Data Management Platform on Hadoop
Venkatesh Seetharam
(Incubating)
![Page 2: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/2.jpg)
© Hortonworks Inc. 2011
whoamiHortonworks Inc.
–Architect/Developer
–Lead Data Management efforts
Apache–Apache Falcon Committer, IPMC
–Apache Knox Committer
–Apache Hadoop, Sqoop, Oozie Contributor
Part of the Hadoop team at Yahoo! since 2007–Senior Principal Architect of Hadoop Data at Yahoo!
–Built 2 generations of Data Management at Yahoo!
Page 2Architecting the Future of Big Data
![Page 3: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/3.jpg)
Agenda
2 Falcon Overview
1 Motivation
3 Falcon Architecture
4 Case Studies
![Page 4: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/4.jpg)
MOTIVATION
![Page 5: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/5.jpg)
Data Processing Landscape
External data source
Acquire (Import)
Data Processing (Transform/Pipeline)
Eviction Archive
Replicate(Copy)
Export
![Page 6: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/6.jpg)
Core ServicesProcessManagement
• Relays• Late data handling• Retries
Data Management
• Import/Export• Replication• RetentionD
ataGovernance
• Lineage• Audit• SLA
![Page 7: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/7.jpg)
FALCON OVERVIEW
![Page 8: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/8.jpg)
Holistic Declaration of Intent
picture courtersy: http://bigboxdetox.com
![Page 9: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/9.jpg)
Entity Dependency Graph
Hadoop / Hbase … Cluster
External data
source
feed Process
depends depends
depends
depends
![Page 10: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/10.jpg)
<?xml version="1.0"?><cluster colo=”NJ-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations></cluster>
Needed by distcp for replications
Writing to HDFS
Used to submit processes as MR
Submit Oozie jobs
Hive metastore to register/deregister partitions and get events on partition
availability
Used For alerts
HDFS directories used by Falcon server
Cluster Specification
![Page 11: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/11.jpg)
Feed Specification<?xml version="1.0"?><feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <tags externalSource=TeradataEDW-1, externalTarget=Marketing> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
<retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/></feed>
Feed run frequency in mins/hrs/days/mths
Late arrival cutoff
Global location across clusters - HDFS paths
or Hive tables
Feeds can belong to multiple groups
One or more source & target clusters for
retention & replication
Access Permissions
Metadata tagging
![Page 12: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/12.jpg)
Process Specification<process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)">
<late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process></process>
How frequently does the process run , how many
instances can be run in parallel and in what order
Which cluster should the process run on and when
The processing logic.
Retry policy on failure
Handling late input feeds
Input & output feeds for process
![Page 13: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/13.jpg)
Late Data HandlingDefines how the late (out of band) data is handled
Each Feed can define a late cut-off value<late-arrival cut-off="hours(4)”/>
Each Process can define how this late data is handled<late-process policy="exp-backoff" delay="hours(1)”>
<late-input input="input" workflow-path="/apps/clickstream/late" />
</late-process>
Policies include: backoff exp-backoff final
![Page 14: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/14.jpg)
Retry PoliciesEach Process can define retry policy
<process name="[process name]">
...
<retry policy=[retry policy] delay=[retry delay]attempts=[attempts]/>
<retry policy="backoff" delay="minutes(10)" attempts="3"/>
...
</process>
Policies include:backoffexp-backoff
![Page 15: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/15.jpg)
Lineage
![Page 16: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/16.jpg)
Apache Falcon
Provides Orchestrates
Data Management Needs Tools
Multi Cluster Management Oozie
Replication Sqoop
Scheduling Distcp
Data Reprocessing Flume
Dependency Management Map / Reduce
Eviction Hive and Pig Jobs
Governance
Falcon provides a single interface to orchestrate data lifecycle.Sophisticated DLM easily added to Hadoop applications.
Falcon: One-stop Shop for Data Management
![Page 17: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/17.jpg)
FALCON ARCHITECTURE
![Page 18: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/18.jpg)
High Level Architecture
Apache Falcon
Oozie
Messaging
HCatalog
HDFS
Entity
Entity status
Process status / notification
CLI/REST
JMS
Config store
![Page 19: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/19.jpg)
Feed Schedule
Cluster xml
Feed xml Falcon
Falcon config store / Graph
Retention / Replication workflow
Oozie Scheduler HDFS
JMS Notification per action
Catalog service
Instance Management
![Page 20: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/20.jpg)
Process Schedule
Cluster/feed xml
Process xml
Falcon
Falcon config store / Graph
Process workflow
Oozie Scheduler HDFS
JMS Notification per available
feed
Catalog service
Instance Management
![Page 21: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/21.jpg)
Physical Architecture
• STANDALONE– Single Data Center– Single Falcon Server– Hadoop jobs and relevant
processing involves only one cluster
• DISTRBUTED– Multiple Data Centers– Falcon Server per DC– Multiple instances of hadoop
clusters and workflow schedulers
HADOOPStore & Process
FalconServer
(standalone)
Site 1
HADOOPStore & Process
replication
HADOOPStore & Process
FalconServer
(standalone)
Site 1
HADOOPStore & Process
replication
Site 2
FalconServer
(standalone)
Falcon PrismServer
(distributed)
![Page 22: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/22.jpg)
CASE STUDY Multi Cluster Failover
![Page 23: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/23.jpg)
Multi Cluster – Failover
> Falcon manages workflow, replication or both.> Enables business continuity without requiring full data reprocessing.> Failover clusters require less storage and CPU.
Staged Data
Cleansed Data
Conformed Data
Presented Data
Staged Data
Presented Data
BI and Analytics
Primary Hadoop Cluster
Failover Hadoop Cluster
Re
plic
atio
n
![Page 24: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/24.jpg)
Retention Policies
Staged Data
Retain 5 Years
Cleansed Data
Retain 3 Years
Conformed Data
Retain 3 Years
Presented Data
Retain Last Copy Only
> Sophisticated retention policies expressed in one place.> Simplify data retention for audit, compliance, or for data re-processing.
![Page 25: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/25.jpg)
CASE STUDY Distributed Processing
Example: Digital Advertising @ InMobi
![Page 26: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/26.jpg)
Processing – Single Data Center
Ad Request data
Impression render event
Click event
Conversion event
Continuous Streaming (minutely)
Hourly summary
Enrichment (minutely/5 minutely)
Summarizer
![Page 27: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/27.jpg)
Global Aggregation
Ad Request data
Impression render event
Click event
Conversion event
Continuous
Streaming (minutely)
Hourly summa
ry
Enrichment (minutely/5 minutely) Summarizer
Ad Request data
Impression render event
Click event
Conversion event
Continuous
Streaming (minutely)
Hourly summa
ry
Enrichment (minutely/5 minutely) Summarizer
……..
Dat
a C
ente
r 1
Dat
a C
ente
r N
Consumable global
aggregate
![Page 28: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/28.jpg)
HIGHLIGHTS
![Page 29: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/29.jpg)
Future
Data Governance
Data Pipeline Designer
Data Acquisition – file-based
Monitoring/Management Dashboard
1
2
3
4
![Page 30: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/30.jpg)
Summary
![Page 31: Apache Falcon at Hadoop Summit Europe 2014](https://reader035.fdocuments.us/reader035/viewer/2022062303/554f65dfb4c905bb178b4a46/html5/thumbnails/31.jpg)
Questions?Apache Falcon
http://falcon.incubator.apache.orgmailto: [email protected]
Venkatesh [email protected]#innerzeal