Data Governance in Apache Falcon - Hadoop Summit Brussels 2015

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache FalconHadoop Data Governance

Hortonworks. We do Hadoop.


Venkatesh Seetharam

Architect, Data Management

Hortonworks Inc.

PMC, Apache Falcon

PMC, Apache Knox

Proposed Apache Atlas


Agenda

Overview Components Features Governance

• Motivation

• High Level

Summary

• Entities:

• Clusters

• Feeds

• Process

• Monitoring

• Tracing

• Replication

• Retention

• Governance

• Replication to Cloud

• Recipes

• User Interface


Motivation for Apache Falcon


Simple Data Pipeline…

Page 5


Add Data Management Capability to the Pipeline

Page 6


Pipeline Becomes Considerably More Complex

Results in Many Complex Oozie Workflows

Data Management Requirements


Introduction to Apache Falcon


Falcon Overview

Centrally Manage Data Lifecycle– Centralized definition & management of pipelines for data ingest, process &

export

Business Continuity & Disaster Recovery– Out of the box policies for data replication & retention

– End to end monitoring of data pipelines

Address audit & compliance requirements– Visualize data pipeline lineage

– Track data pipeline audit logs

– Tag data with business metadata

The data traffic cop


Complicated Pipeline Simplified with Apache Falcon

Falcon Generates and Instruments Oozie Workflows


Falcon Architecture

Centralized Falcon Orchestration Framework

Hadoop ecosystem tools

Falcon Server JMS

API&UI

AMBARI

HDFS / Hive

Oozie

Entity Specs

Scheduled Jobs Process Status

MapRed / Pig / Hive / Sqoop / Flume / DistCP

Data stewards

+ Hadoop admins


Falcon Basic Concepts

• Cluster: Represents the “interfaces” to a Hadoop cluster• Feed: Defines a “dataset” File, Hive Table or Stream• Process: Consumes feeds, invokes processing logic & produces feeds

Page 12

All these put together represent ‘Data Pipelines’ in Hadoop

CLUSTER

FEEDaka

DATASETPROCESS

RUNS ON

STORED IN

INPUT TO

CREATES


Data Pipeline: Definition

• Flexible based pipeline specification– JAXB / JSON / JAVA / XML– Modular - Clusters, feeds & processes defined separately and then linked together– Easy to re-use across multiple pipelines

• Out of the box policies– Predefined policies for replication, late data handling & eviction – Easily customization of policies

• Extensible– Plug in external solutions at any step of the pipeline– Eg. Invoke third party data obfuscation components


Flexibility in Processing

Common types of processing engines can be tied to Falcon processes

Oozie workflows Pig scripts HQL scripts


Data Pipeline: Monitoring

DATA

Primary site DR site

Centralized monitoring of data pipeline With Falcon + Ambari

Pipeline run alerts

Hadoop Cluster-1 Hadoop Cluster-2

Pipeline run history

Pipeline Scheduling

raw clean prep raw clea

n prep


Replication with Falcon

Staged DataPresented

DataCleansed

DataConformed

Data


Data

Rep

licat

ion

Failover Hadoop Cluster

Primary Hadoop Cluster

Rep

licat

ion

BI / Analytics

BusinessObjects BI

• Falcon manages workflow and replication• Enables business continuity without requiring full data reprocessing• Failover clusters can be smaller than primary clusters


Data Retention with Falcon


DataCleansed

DataConformed

Data

Retain 5 Years

Retain Last Copy Only

Retain 3 Years

Retain 3 Years

• Sophisticated retention policies expressed in one place• Simplify data retention for audit, compliance, or for data re-processing

Ret

entio

n P

olic

y


Late Data Handling with Falcon

Staged Data Combined Data

Online Transaction Data

(via Sqoop)

Web Log Data (via FTP)

Wait up to 4 hours for FTP data to arrive

• Processing waits until all required input data is available• Checks for late data arrivals, issues retrigger processing as necessary• Eliminates writing complex data handling rules within applications


HCatalog

Table access

Aligned metadata

REST API

• Raw Hadoop data• Inconsistent, unknown• Tool specific access

Apache Falcon provides metadata services via HCatalog

Metadata Services with HCatalog

• Consistency of metadata and data models across tools (MapReduce, Pig, Hbase, and Hive)

• Accessibility: share data as tables in and out of HDFS• Availability: enables flexible, thin-client access via REST API

Shared table and schema management opens the platform

Page 19


Data Governance in Apache Falcon


Data Pipeline: Tracing

.

Purchase feed

Customer feed

Product feedStore feed

View dependencies between clusters,

datasets and processes

Data pipeline dependencies

Add arbitrary tags to feeds & processes

Credit

feed

Sensitive Encrypted

Data pipeline tagging

Coming Soon

Know who modified a dataset when and into

what

Data pipeline audits

File-1

File-2

File-3

Analyze how a dataset reached a

particular state

Data pipeline lineage


Custom Metadata in Falcon

• Metadata on Ingest (Content)– What is the format I expect my data to be in?– What source systems did the data come from, owners?– Answer: ingest descriptors + Hcat schema versioning

• Metadata for Security (Access Controls)– How is each column blinded or encrypted?– Can I trust that I can join data across tables? What if email is encrypted differently?– Answer: security descriptors

• Metadata for lineage (Source, History)– How do I chase down sources of data leading to reports and data?– Answer: lineage carried forward per workflow

• Metadata for marts (Usage Constraints, Enrichment)– How do I materialize views and drop views as needed?


Entity Dependency in Falcon

• Dependencies between Falcon entity definitions: cluster, feed & process– Lineage attributes: workflows, input/output feed windows, user, input and output paths, workflow engine,

input/output size


Lineage in Falcon


Audit, Tagging and Access Control

• Tagging– Allows custom tags in entities– Can decorate process entities pipeline names

• Access Control– Support for ACL in entities– Authorization driven based on ACLs in entities

• Audit– Each execution is controlled by Falcon and runs are audited– Correlate the execution with Lineage (Design)

• Search– Search based on Tags, Pipelines, etc.– Full-text search


Technology

• Metadata Repository– Titan Graph Database – Pluggable backing store, berkelydbje, Hbase

• Entity Metadata– Tags, Entities are stored in the repository

• Execution Metadata– Execution metadata are stored in the repository as well – this is unique to Falcon– Optional inputs

• Search– Pluggable backend – Solr or Elastic Search


New in Apache Falcon 0.6.0What is coming soon?


DR Mirroring of HDFS with Recipes

•Mirroring for Disaster Recovery and Business continuity use cases.

•Customizable for multiple targets and frequency of synchronization

•Recipes: Template model re-use of complex workflows

Recipe

Reduce

Cleanse

Replicate

Properties

WorkflowTemplate

Recipe

Reduce

Cleanse

Replicate

Properties

Recipe

Reduce

Cleanse

Replicate

Properties

WorkflowTemplate


Replication to Cloud

•Seemlessly replicate to Cloud targets

•Replicate from Cloud as a source.

•Support for Amazon S3 and Microsoft Azure

On Prem Cluster


Q & A


Thank you!

Learn more at:hortonworks.com/hadoop/falcon/

Data Governance in Apache Falcon - Hadoop Summit Brussels 2015

Technology

Transcript of Data Governance in Apache Falcon - Hadoop Summit Brussels 2015