Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
-
Upload
srikanth-sundarrajan -
Category
Technology
-
view
577 -
download
0
description
Transcript of Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
![Page 1: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/1.jpg)
Hadoop First ETL On Apache Falcon
Srikanth Sundarrajan Naresh Agarwal
![Page 2: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/2.jpg)
About Authors ! Srikanth Sundarrajan
! Principal Architect, InMobi Technology Services
! Naresh Agarwal ! Director – Engineering, InMobi Technology Services
![Page 3: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/3.jpg)
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
![Page 4: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/4.jpg)
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
![Page 5: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/5.jpg)
ETL (Extract Transform Load)
Intelligence
Information
Data
Value
![Page 6: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/6.jpg)
ETL Use cases
Data Warehouse
Data Migration
Data Consolidation
Master Data Management
Data Synchronization
Data Archiving
![Page 7: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/7.jpg)
ETL Authoring
Hand coded
In-house tools
Off-shelf tools
![Page 8: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/8.jpg)
ETL & Big Data – Challenges
Challenges
Volume
Variety Velocity
![Page 9: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/9.jpg)
Big Data ETL ! Mostly Hand coded (High Cost – Implementation +
Maintenance) ! Map Reduce
! Hive (i.e. SQL) ! Pig ! Crunch / Cascading
! Spark
! Off-shelf tools (Scale/Performance) ! Mostly Retrofitted
![Page 10: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/10.jpg)
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
![Page 11: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/11.jpg)
Apache Falcon ! Off the shelf, Falcon provides standard data
management functions through declarative constructs ! Data movement recipes
! Cross data center replication
! Cross cluster data synchronization
! Data retention recipes ! Eviction
! Archival
![Page 12: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/12.jpg)
Apache Falcon ! However ETL related functions are still largely left
to the developer to implement. Falcon today manages only ! Orchestration ! Late data handling / Change data capture
! Retries ! Monitoring
![Page 13: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/13.jpg)
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
![Page 14: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/14.jpg)
Pipeline Designer – Basics
![Page 15: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/15.jpg)
Pipeline Designer – Basics ! Feed
! Is a data entity that Falcon manages and is physically present in a cluster.
! Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog
! Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions
![Page 16: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/16.jpg)
Pipeline Designer – Basics
![Page 17: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/17.jpg)
Pipeline Designer – Basics ! Process
! Workflow that defines various actions that needs to be performed along with control flow
! Executes at a specified frequency on one or more clusters
! Pipelines ! Logical grouping of Falcon processes owned and
operated together
![Page 18: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/18.jpg)
Pipeline Designer – Basics
![Page 19: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/19.jpg)
Pipeline Designer – Basics ! Actions
! Actions in designer are the building blocks for the process workflows.
! Actions have access to output variables earlier in the flow and can emit output variables
! Actions can transition to other actions ! Default / Success Transition
! Failure Transition
! Conditional Transition
! Transformation action is a special action that further is a collection of transforms
![Page 20: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/20.jpg)
Pipeline Designer – Basics
![Page 21: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/21.jpg)
Pipeline Designer – Basics ! Transforms
! Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs
! Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow
! Composite Transformations ! Transforms that are built through a combination of
multiple primitive transforms
! Possible to add more transforms and extend the system
![Page 22: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/22.jpg)
Pipeline Designer – Basics ! Deployment & Monitoring
! Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process
![Page 23: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/23.jpg)
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
![Page 24: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/24.jpg)
Pipeline Designer Service
Pipeline Designer
Pipeline Designer Service
REST API
Versioned Storage
Flow / Action /
Transforms Compiler + Optimizer
Falcon Server
Hcatalog Service
Des
igner
UI
Falc
on D
ashboa
rd
Process
Feed
Schema
![Page 25: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/25.jpg)
Pipeline Designer – Internals ! Transformation actions are compiled into PIG
scripts
! Actions and Flows are compiled into Falcon Process definitions
![Page 26: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/26.jpg)
![Page 27: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/27.jpg)
![Page 28: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/28.jpg)
![Page 29: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/29.jpg)
![Page 30: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/30.jpg)
![Page 31: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/31.jpg)
![Page 32: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/32.jpg)
![Page 33: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/33.jpg)
![Page 34: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/34.jpg)
![Page 35: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/35.jpg)
![Page 36: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/36.jpg)
![Page 37: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/37.jpg)
![Page 38: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/38.jpg)
![Page 39: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer](https://reader033.fdocuments.us/reader033/viewer/2022060108/554f5ceab4c905c8088b477c/html5/thumbnails/39.jpg)
Q & A