Falcon - Data Management Platform on Hadoop (Beyond ETL)

30
Data Management Platform on Hadoop Srikanth Sundarrajan Venkatesh Seetharam (Incubatin g)

description

Hadoop and its ecosystem of products have made storing and processing massive amounts of data common place. This has enabled numerous businesses to gain valuable foresights that they never could have in the past. While it is easy to leverage Hadoop for crunching large volumes of data, organizing data, managing life cycle of data and processing data is fairly involved. This is solved adequately well in a traditional data platform involving data warehouses and standard ETL (extract-transform-load) tools, but remains largely unsolved today. Besides data processing complexities, Hadoop presents new set of challenges relating to management of data. Data Management on Hadoop encompasses data motion (import/export), process orchestration (data pipelines, late/re-processing, scheduling), lifecycle management (retention, replication, DR, anonymization, archival), data discovery (data classification, Lineage), etc. among other concerns that are beyond ETL. The presentation focuses on a new data processing and management platform for Hadoop, Falcon that attempts to solve this problem by leveraging existing stacks in the Hadoop ecosystem. Falcon has been in production for nearly a year at InMobi and has been managing hundreds of feeds and processes.

Transcript of Falcon - Data Management Platform on Hadoop (Beyond ETL)

Page 1: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Data Management Platform on Hadoop

Srikanth SundarrajanVenkatesh Seetharam

(Incubating)

Page 2: Falcon - Data Management Platform on Hadoop (Beyond ETL)

whoami

Principal ArchitectInMobi

Apache Hadoop Contributor

Hadoop Team @Yahoo!

Srikanth Sundarraj

an Architect/DeveloperHortonworks

Apache Hadoop Contributor

Data Management @ Yahoo!

Venkatesh

Seetharam

Page 3: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Agenda

2 Falcon Overview

1 Motivation

3 Case Studies

4 Questions & Answers

Page 4: Falcon - Data Management Platform on Hadoop (Beyond ETL)

MOTIVATION

Page 5: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Data Processing Landscape

External data source

Acquire (Import)

Data Processing (Transform/Pipeline)

Eviction Archive

Replicate(Copy)

Export

Page 6: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Core ServicesProcess

• Late data management• Relays

Data management

• Acquisition• Replication• Retention

Operability

• SLA• Lineage

Page 7: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Process Management – Relays

picture courtersy: http://istockphoto.com/

Page 8: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Late Data Management

picture courtersy: http://iwebask.com

Page 9: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Data Retention As Service

picture courtersy: http://vimeo.com/

Page 10: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Data Replication As Service

picture courtersy: http://boylesmedia.com

Page 11: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Data Acquisition As Service

picture courtersy: http://wmpu.org

Page 12: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Operability – Dashboard

picture courtersy: http://www.opentrack.ch/

Page 13: Falcon - Data Management Platform on Hadoop (Beyond ETL)

FALCON OVERVIEW

Page 14: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Holistic Declaration of Intent

picture courtersy: http://bigboxdetox.com

Page 15: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Entity Dependency Graph

Hadoop / Hbase … Cluster

External data

source

feed Process

depends depends

depends

depends

Page 16: Falcon - Data Management Platform on Hadoop (Beyond ETL)

High Level Architecture

Apache Falcon

Oozie

Messaging

HCatalog

Hadoop

Entity

Entity status

Process status / notification

CLI/REST

JMS

Config store

Page 17: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Feed Schedule

Cluster xml

Feed xml Falcon

Falcon config store / Graph

Retention / Replication workflow

Oozie Scheduler HDFS

JMS Notification per action

Catalog service

Instance Management

Page 18: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Process Schedule

Cluster/feed xml

Process xml

Falcon

Falcon config store / Graph

Process workflow

Oozie Scheduler HDFS

JMS Notification per available

feed

Catalog service

Instance Management

Page 19: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Physical Architecture

Falcon Colo 1

Falcon Colo 2

Falcon Colo 3

Scheduler

Scheduler

Scheduler

Falcon – PrismGlobal view

Page 20: Falcon - Data Management Platform on Hadoop (Beyond ETL)

CASE STUDY Multi Cluster Failover

Page 21: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Multi Cluster – Failover

> Falcon manages workflow, replication or both.> Enables business continuity without requiring full data reprocessing.> Failover clusters require less storage and CPU.

Staged Data

Cleansed Data

Conformed Data

Presented Data

Staged Data

Presented Data

BI and Analytics

Primary Hadoop Cluster

Failover Hadoop Cluster

Re

plic

atio

n

Page 22: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Retention Policies

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

> Sophisticated retention policies expressed in one place.> Simplify data retention for audit, compliance, or for data re-processing.

Page 23: Falcon - Data Management Platform on Hadoop (Beyond ETL)

CASE STUDY Distributed Processing

Example: Digital Advertising @ InMobi

Page 24: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Hadoop @ InMobiAbout InMobi

Worlds leading independent mobile advertising company

Hadoop usage at InMobi ~ 6 Clusters > 1PB of storage > 5TB new data ingested each day > 20TB data crunched each day > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase > 175K hadoop jobs / day > 60K Oozie workflows / day 300+ Falcon feed definitions 100+ Falcon process definitions

Page 25: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Processing – Single Data Center

Ad Request data

Impression render event

Click event

Conversion event

Continuous Streaming (minutely)

Hourly summary

Enrichment (minutely/5 minutely)

Summarizer

Page 26: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Global Aggregation

Ad Request data

Impression render event

Click event

Conversion event

Continuous

Streaming (minutely)

Hourly summa

ry

Enrichment (minutely/5 minutely) Summarizer

Ad Request data

Impression render event

Click event

Conversion event

Continuous

Streaming (minutely)

Hourly summa

ry

Enrichment (minutely/5 minutely) Summarizer

……..

Dat

a C

ente

r 1

Dat

a C

ente

r N

Consumable global

aggregate

Page 27: Falcon - Data Management Platform on Hadoop (Beyond ETL)

HIGHLIGHTS

Page 28: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Future

Security

Embed Pig/Hive scripts

Data Acquisition – file-based

Monitoring/Management Dashboard

1

2

3

4

Page 29: Falcon - Data Management Platform on Hadoop (Beyond ETL)

Summary