Introduction to Apache Airflow - Data Day Seattle 2016

68
Introducing Apache Airflow ( Incubating ) Sid Anand ( @r39132 ) Data Day Seattle 2016 1

Transcript of Introduction to Apache Airflow - Data Day Seattle 2016

Introducing Apache Airflow (Incubating)

Sid Anand (@r39132) Data Day Seattle 2016

1

About Me

2

Work [ed | s] @

Maintainer on

Reports to

Co-Chair for

Apache Airflow

3

What is it?

4

Apache Airflow : What is it?Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)

5

Apache Airflow : What is it?Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)

It ships with • DAG Scheduler • Web application (UI) • Powerful CLI

6

Apache Airflow : What is it?

Airflow - Authoring DAGs

7

Airflow: Visualizing a DAG

8

Airflow: Author DAGs in Python! No need to bundle many XML files!

Airflow - Authoring DAGs

9

Airflow: The Tree View offers a view of DAG Runs over time!

Airflow - Authoring DAGs

Airflow - Performance Insights

10

Airflow: Gantt charts reveal the slowest tasks for a run!

11

Airflow: …And we can easily see performance trends over time

Airflow - Performance Insights

12

Apache Airflow : What is it?When would you use a Workflow Scheduler like Airflow?

• ETL Pipelines

• Machine Learning Pipelines

• Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification,

Recommender System, etc…

• General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment

13

Apache Airflow : What is it?

What should a Workflow Scheduler do well? • Schedule a graph of dependencies

• where Workflow = A DAG of Tasks

• Handle task failures

• Report / Alert on failures

• Monitor performance of tasks over time

• Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met

• Scale

14

Apache Airflow : What is it?What Does Apache Airflow Add?

• Configuration-as-code

• Usability - Stunning UI / UX

• Centralized configuration

• Resource Pooling

• Extensibility

Apache Airflow

15

Incubating

16

Apache Airflow : Incubating

Timeline • Airflow was created @ Airbnb in 2015 by Maxime

Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator

Today • 166+ Contributors • 300+ Users • 40+ companies officially using it! • 9 Committers/Maintainers <— We’re growing here

Agari

17

What We Do!

Agari : What We Do

18

19

Agari : What We Do

20

Agari : What We Do

21

Agari : What We Do

22

Agari : What We Do

23

Enterprise Customers

email metadata

apply trust

models

email md + trust score

Agari’s Current Product

Agari : What We Do

24

email metadata

apply trust

models

email md+ trust score

Agari’s Future ProductEnterprise Customers

Agari : What We Do

Apache Airflow @ AgariHow Do We Use It?

25

Classes of Orchestration

26

apply trust models

(message scoring)

build trust models

cron++ (general

job scheduler)

New Product (Enterprise Protect)

Operational Automation BI / ETL

N / A

Classes of Orchestration

27

apply trust models

(message scoring)

build trust models

cron++ (general

job scheduler)

New Product (Enterprise Protect)

Operational Automation BI / ETL

N / A

This Talk

Use-Case : Message ScoringBatch Pipeline Architecture

28

Use-Case : Message Scoring

29

enterprise Aenterprise Benterprise C

S3

S3 uploads every 15 minutes

Use-Case : Message Scoring

30

enterprise Aenterprise Benterprise C

S3

Airflow kicks of a Spark message scoring job

every hour

Use-Case : Message Scoring

31

enterprise Aenterprise Benterprise C

S3

Spark job writes scored messages and stats to

another S3 bucket

S3

Use-Case : Message Scoring

32

enterprise Aenterprise Benterprise C

S3

This triggers SNS/SQS messages events

S3

SNS

SQS

Use-Case : Message Scoring

33

enterprise Aenterprise Benterprise C

S3

An Autoscale Group (ASG) of Importers spins up when it detects SQS

messages

S3

SNS

SQS

Importers

ASG

34

enterprise Aenterprise Benterprise C

S3

The importers rapidly ingest scored messages and aggregate statistics into

the DB

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

35

enterprise Aenterprise Benterprise C

S3

Users receive alerts of untrusted emails & can review them in

the web app

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

36

enterprise Aenterprise Benterprise C

S3 S3

SNS

SQS

Importers

ASGDB

Airflow manages the entire process

Use-Case : Message Scoring

Use-Case : Message ScoringAirflow DAG

37

38

Airflow DAG

39

5 minute wait for S3 eventual consistency

Airflow DAG

40

1 hour a day, we also build new models

Airflow DAG

41

build models

Airflow DAG

42

dummy needed for branch operator

Airflow DAG

43

• trigger_rule: one_success

Airflow DAG

44

Prep Spark Run

Airflow DAG

45

• Run Spark

• Verify a record is written to the DB

• Wait for the SQS queue to empty

Airflow DAG

46

• Compute discrepancies

• Send email report

• Update monitoring graphs

• Raise SLA (correctness) alerts

Airflow DAG

SLAs & InsightsAirflow

47

48

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Scalable/Available

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs (e.g. 1 hour)

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

• ASGs, SQS, SNS, S3

49

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs (e.g. 1 hour)

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

SLA

SLA• ASGs, SQS, SNS, S3 Scalable/Available

50

Correctness : Email Reporting

orgs

51

Correctness : Email Reporting

For each org, we check for duplicate or missing data as a count & percentage

orgs

52

Correctness : Email Reporting

These are the 3 stages of the pipeline. We can detect where a discrepancy is coming from - often related to a code push!

orgs

53

Correctness : Monitoring

54

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

55

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

56

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

dag = DAG(DAG_NAME, schedule_interval='@hourly', default_args=default_args, sla_miss_callback=sla_alert_func)

57

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

Correctness SLA miss

58

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness & Correctness SLA misses sent to PagerDuty/VictorOps

Use-Case : Model Building v2For Both Batch & Near Realtime Scoring Pipelines

59

60

Airflow DAG

61

Model Building DAG

Launch an EMR cluster

62

Run model building as EMR steps

Model Building DAG

63

Validate models

Send email notification if tests fail

Model Building DAG

64

Terminate EMR cluster

Model Building DAG

Apache Airflow Next Steps

65

Areas for Improvement

66

Apache Airflow Next Steps

Improvement Areas

• Security

• API (though we do have a CLI)

• Deployment / Versioning

• Execution Scale Out

• On-demand Execution

Acknowledgments

67

• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones

• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle

None of this work would be possible without the contributions of the strong team below

Questions? (@r39132)

68