Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to...
Transcript of Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to...
![Page 1: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/1.jpg)
Data Pipeline testing now made easy!
With Apache Falcon
![Page 2: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/2.jpg)
$whoami❖ Pallavi Rao
➢ Architect, InMobi➢ Committer, Apache Falcon➢ Contributor, Apache PIG
❖ Pavan Kumar Kolamuri➢ Sr. Software Engineer, InMobi➢ Contributor, Apache Falcon and Oozie
2
![Page 3: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/3.jpg)
What is in store for you?❖ Some history and introduction to Apache Falcon
❖ Falcon Unit - A new feature in v0.7
❖ Falcon Unit - How it simplifies testing pipelines
❖ Demo , Q&A
3
![Page 4: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/4.jpg)
Once upon a time...
4
![Page 5: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/5.jpg)
What kept us up at night?
5
❏ Failures
❏ Data arriving late
❏ Re-processing
❏ Varied Data Replication
❏ Varied Data Retention
❏ Data Archival
❏ Lineage
❏ SLA monitoring
![Page 6: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/6.jpg)
The pattern
6
![Page 7: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/7.jpg)
7
The concoction
![Page 8: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/8.jpg)
Concoction.. distributed
8
![Page 9: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/9.jpg)
Some maladies cured
9
Data Management
Data Governance
Process Management
● Relays● Late Data Handling● Failure Retries ● Reruns
● Data Import/Export● Retention● Replication● Archival
● Lineage● Audit● SLA
![Page 10: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/10.jpg)
Sample pipeline in Falcon
10
Click Logs
Click Enhancer
Enhanced Clicks
Hourly Aggregation
Hourly Clicks
Daily clicks
Daily Aggregation
Metadata
Retention : 2 hours Frequency : 5 minsLate Data arrival
Retention : 2 daysReplication required
Retention : 1 day
Retention : 7 days Replication required
Falcon Feed
Falcon Process
![Page 11: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/11.jpg)
Cluster Specification<cluster colo="default" description="" name="corp" xmlns="uri:falcon:cluster:0.1"> <tags>[email protected], [email protected], _department_type=forecasting</tags> <interfaces> <interface type="readonly" endpoint="webhdfs://localhost:14000" version="1.1.2"/> <interface type="write" endpoint="hdfs://localhost:9000" version="1.1.2"/> <interface type="execute" endpoint="localhost:8032" version="1.1.2"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.1.0"/> <interface type="registry" endpoint="thrift://localhost:12000" version="0.11.0"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <location name="temp" path="/tmp"/> <location name="working" path="/projects/falcon/working"/> </locations> <properties> <property name="field1" value="value1"/> </properties></cluster>
11
How to access data
Where to execute
HCat
Falcon cache
User defined props
![Page 12: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/12.jpg)
Feed specification<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="corp" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}
/${MINUTE}"/> </locations> </cluster> </clusters> …..</feed>
12
Frequency
Location
SLA Monitoring
Data Retention
Data Replication
![Page 13: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/13.jpg)
Process specification
13
<process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"
/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process></process>
Where should the process run?
How should the process run?
What to consume?
What to produce?
Processing logic
Late Data processing
![Page 14: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/14.jpg)
Why Falcon Unit?
14
![Page 15: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/15.jpg)
Before Falcon Unit
Unit Tests for each module using either PigUnit or MRUnit or JUnit.
Integration Tests executed by bringing up AWS instances with the entire stack.
15
![Page 16: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/16.jpg)
Before Falcon Unit
16
Falcon Feed
Falcon Process
Spec. Invalid
OK
OK
Invalid Output
Invalid Input
Improper Replication
OK
OK
![Page 17: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/17.jpg)
Motivation for Falcon Unit
❖ User errors caught only at deploy time.
➢ Input/Output feeds and paths not getting resolved.
➢ Errors in specification.
❖ Integration Tests require environment setup/tearDown.
➢ Messy deployment scripts.
➢ Time consuming.
❖ Debugging was cumbersome.
➢ Logs scattered.
17
![Page 18: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/18.jpg)
Falcon Unit
18
Falcon Unit
In Process execution env.● Local Oozie● Local File System● Local Job Runner● Local Message Queue
Actual cluster● Oozie● HDFS● YARN● Active MQ
Test suite
![Page 19: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/19.jpg)
What you can test
19
Data Management
Data Governance
Process Management
● Data creation● Data injection● Retention● Replication
● Lineage● Data availability for verification
● Validation of definition ● Entity scheduling and status verification● Correctness of data window being picked up.● Reruns● Missing dependencies/properties
![Page 20: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/20.jpg)
After Falcon Unit
20
Falcon Feed
Falcon Process
OK
OK
OK
OK
OK
OK
OK
OK
TESTED
OK
![Page 21: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/21.jpg)
Falcon Unit Illustrated
21
![Page 22: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/22.jpg)
Capabilities with example
22
❖ Entity creation and data flow validation.
❖ Data Injection.
❖ Data Retention and Replication.
❖ Seamless API for cluster and local mode.
![Page 23: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/23.jpg)
Example Pipeline
23
Hourly Clicks
Daily clicks
Daily Aggregation
...
Consumes
Produces
Deferred Clicks
...
![Page 24: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/24.jpg)
24
Cluster CreationCluster Creation :
→ Local Mode
submit(EntityType.Cluster, coloName, clusterName, propsMap);
submitCluster(); - Uses defaults
→ Cluster Mode
submit(EntityType.Cluster, <Path to Cluster XML>);
![Page 25: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/25.jpg)
Feed CreationSubmit Feed
submit(EntityType.Feed, <Path to Hourly Clicks XML>);
Inject DatacreateData("HourlyClicks", "local", scheduleTime, <test data path>, numinstances);
25
/projects/falcon/clicks/hourly/2015/09/28/00/<data>
…./01/<data>
…./02/<data>
…./03/<data>
…./04/<data>
![Page 26: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/26.jpg)
Process CreationProcess Submission:
submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local
submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode
Process Scheduling:scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);
Process Verification:getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);
26
![Page 27: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/27.jpg)
Data Retention Data Retention: ● Data retention can be validated by scheduling feed in both cluster mode
and local modecreateData("HourlyClicks", "local", timeStamp, <test data path>);
schedule(EntityType.FEED, "HourlyClicks", "local" );
status = getInstanceStatus(EntityType.FEED, "HourlyClicks");
● Falcon Unit provides APIs for validation of existence of paths.
27
![Page 28: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/28.jpg)
Data Replication
Data Replication can also be tested using Falcon Unit :submitCluster(coloName, srcCluster, propsMap);
submitCluster(coloName, targetCluster, propsMap);
createData("HourlyClicks", "srcCluster", timeStamp, <test data path>);
schedule(EntityType.FEED, “HourlyClicks”, targetCluster);
status = getInstanceStatus(EntityType.FEED, feed, targetCluster);
Assert.assertEquals(status, WorkflowStatus.SUCCEEDED);
28
![Page 29: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/29.jpg)
Going forward ...
❖ Improved data injection➢ Generation of test data from template➢ Sampling of production data for testing
❖ Support for other data lifecycle operations➢ Data ingestion, export
❖ Maven plugin for build time validation of definitions.
29
![Page 30: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/30.jpg)
Demo
30
![Page 31: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies](https://reader033.fdocuments.us/reader033/viewer/2022060504/5f1d85aa8537fa6aad6aa10a/html5/thumbnails/31.jpg)
Questions?
31
If you want to ask later - [email protected] you want to contribute - [email protected]