Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Transcript of Resilient Predictive Data Pipelines (GOTO Chicago 2016)
![Page 1: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/1.jpg)
Resilient Predictive Data Pipelines
Sid Anand (@r39132) GOTO Chicago 2016
1
![Page 2: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/2.jpg)
About Me
2
Work [ed | s] @
Committer & PPMC on
Report to
Co-Chair for
![Page 3: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/3.jpg)
MotivationWhy is a Data Pipeline talk in this Always Available Track?
3
![Page 4: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/4.jpg)
Motivation
4
Always On work has traditionally focused on the availability of Serving Systems :
• Synchronous or Semi-Synchronous• Often Transactional• Latency-sensitive
![Page 5: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/5.jpg)
HA Goals of Serving Systems
5
Outages are Big News Items!
![Page 6: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/6.jpg)
And sometimes your failures become your brand!
6
![Page 7: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/7.jpg)
Motivation
7
Always On work has traditionally focused on the availability of Serving Systems :
• Synchronous or Semi-Synchronous• Often Transactional• Latency-sensitive
![Page 8: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/8.jpg)
Motivation
8
Arguably, the more valuable parts of online services are driven by Data Flow Systems (a.k.a. Data Pipelines):
• Asynchronous• Throughput-sensitive
![Page 9: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/9.jpg)
Data Products
9
![Page 10: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/10.jpg)
Serving + Data Pipelines
10
A business’s viability is based on its ability to
•keep the site up (Always-On Serving Architectures) &
•maintain engagement (views & clicks) with customers (Always-On Data Pipelines)
This talk is about Always On Data Pipelines!
![Page 11: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/11.jpg)
Serving + Data Pipelines
11
Serving Data Pipeline
Data Integration Layer
Web Servers
Microservice Layer
Data Layer (DB, Search, Caching,
Graph DB, Object Store)
FE Load BalancersDAGs + Scheduler + Distributed
Computation Engine
![Page 12: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/12.jpg)
Data Pipeline Challenges
12
![Page 13: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/13.jpg)
Data Pipeline ChallengesProblem 1 : The Blast Radius Problem
13
![Page 14: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/14.jpg)
The Blast Radius Problem
14
• A developer introduces a bug in Data Pipeline Job 1
• Data Pipeline Job 1 reads Data A & writes Data B
![Page 15: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/15.jpg)
The Blast Radius Problem
15
• A developer introduces a bug in Data Pipeline Job 1
• Data Pipeline Job 1 reads Data A & writes Data B
• Data Pipeline Job 2 reads Data B & writes Data C
![Page 16: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/16.jpg)
The Blast Radius Problem
16
• A developer introduces a bug in Data Pipeline Job 1
• Data Pipeline Job 1 reads Data A & writes Data B
• Data Pipeline Job 2 reads Data B & writes Data C
• Data Pipeline Job 3 reads Data C & writes Data D to a Serving System DB
![Page 17: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/17.jpg)
The Blast Radius Problem
17
• A developer introduces a bug in Data Pipeline Job 1
• Data Pipeline Job 1 reads Data A & writes Data B
• Data Pipeline Job 2 reads Data B & writes Data C
• Data Pipeline Job 3 reads Data C & writes Data D to a Serving System DB
• Serving System 4 reads Data D, where the bug is discovered!
![Page 18: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/18.jpg)
The Blast Radius Problem
18
•The previous diagram only shows one path of a tree
•The reality is much worse
•For each data set produced, there are multiple consuming jobs and hence multiple bad downstream outputs
![Page 19: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/19.jpg)
The Blast Radius ProblemAn acute pain point
19
![Page 20: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/20.jpg)
The Blast Radius Problem
20
Detect Bug
Job 1
Job 2
Job 3
Serving System
![Page 21: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/21.jpg)
The Blast Radius Problem
21
Detect Bug
Identify Cause
![Page 22: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/22.jpg)
Identify Cause
22
![Page 23: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/23.jpg)
The Blast Radius Problem
23
Detect Bug
Identify Cause
![Page 24: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/24.jpg)
The Blast Radius Problem
24
Detect Bug
Identify Cause
Deploy a Fix
![Page 25: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/25.jpg)
Rollout a Fix & Rerun all Downstream Jobs in the Affected Time Window
25
![Page 26: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/26.jpg)
The Blast Radius Problem
26
Detect Bug
Identify Cause
Re_Run All Jobs
over a Time Window
![Page 27: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/27.jpg)
Take Aways?
27
• The cost in people, time, and morale for a Data Pipeline bug is high and they can occur frequently.
• In most areas of software, testing is invaluable, less so in data pipelines
• Data Pipeline bugs can be due to a logic problem or bad input data!
• Best Option : Detect & Rollback/Fix Forward
![Page 28: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/28.jpg)
The Blast Radius Solution
28
Detect Bug
Identify Cause
Re_Run 1 Job
![Page 29: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/29.jpg)
Data Pipeline ChallengesTimeliness
29
![Page 30: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/30.jpg)
Timeliness
Job 1 Job 2 Job 3
Definition : job = workflow = DAG of tasks
Job 3’s output is pushed to a
serving system
![Page 31: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/31.jpg)
Timeliness
Run 1 : Monday
Run 2 : Tuesday
Run 3 : Wednesday
Consider the Daily Run Schedule below:
![Page 32: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/32.jpg)
32
Timeliness
Within Time SLA OUT
Run 4
Run 5
Run 6
![Page 33: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/33.jpg)
TimelinessWhy do jobs get slower?
33
![Page 34: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/34.jpg)
34
Timeliness
![Page 35: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/35.jpg)
35
new features
Timeliness
![Page 36: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/36.jpg)
36
Algo taking longer
Timeliness
![Page 37: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/37.jpg)
37
Algo bug fix
Timeliness
![Page 38: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/38.jpg)
38
Algo bug fixes
Timeliness
![Page 39: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/39.jpg)
39
new features
Timeliness
![Page 40: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/40.jpg)
Take Aways?
40
• Data Science & Engineering work is a virtuous cycle of adding features (and the like) + tuning performance
• Latency does matter (a bit)
![Page 41: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/41.jpg)
Design GoalsDesirable Qualities of a Resilient Data Pipeline
41
![Page 42: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/42.jpg)
42
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
![Page 43: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/43.jpg)
43
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…) • Expected data distributions
• All output within time-bound SLAs
• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go
![Page 44: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/44.jpg)
Quickly Recoverable
44
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast radius
• Optimize for MTTR
![Page 45: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/45.jpg)
ImplementationUsing AWS to meet Design Goals
45
![Page 46: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/46.jpg)
SQSSimple Queue Service
46
![Page 47: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/47.jpg)
SQS - Overview
47
AWS’s low-latency, highly scalable, highly available message queue
Infinitely Scalable Queue (though not FIFO)
Low End-to-end latency (generally sub-second)
Pull-based
![Page 48: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/48.jpg)
visibility timer
SQS - Typical Operation Flow
48
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1SQS
Step 1: A consumer reads a message from SQS. This starts a visibility timer!
![Page 49: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/49.jpg)
visibility timer
SQS - Typical Operation Flow
49
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1SQS
Step 2: Consumer persists message contents to DB
![Page 50: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/50.jpg)
visibility timer
SQS - Typical Operation Flow
50
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1SQS
Step 3: Consumer ACKs message in SQS
![Page 51: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/51.jpg)
visibility timer
SQS - Time Out Example
51
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1SQS
Step 1: A consumer reads a message from SQS
![Page 52: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/52.jpg)
visibility timer
SQS - Time Out Example
52
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1SQS
Step 2: Consumer attempts persists message contents to DB
![Page 53: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/53.jpg)
visibility time out
SQS - Time Out Example
53
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1SQS
Step 3: A Visibility Timeout occurs & the message becomes visible again.
![Page 54: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/54.jpg)
visibility timer
SQS - Time Out Example
54
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
m1
SQS
Step 4: Another consumer reads and persists the same message
![Page 55: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/55.jpg)
visibility timer
SQS - Time Out Example
55
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 5: Consumer ACKs message in SQS
![Page 56: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/56.jpg)
SQS - Dead Letter Queue
56
SQS - DLQ
visibility timer
Producer
Producer
Producer
m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Redrive rule : 2x
m1
![Page 57: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/57.jpg)
SNSSimple Notification Service
57
![Page 58: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/58.jpg)
SNS - Overview
58
Highly Scalable, Highly Available, Push-based Topic Service
Whereas SQS is pull-based, SNS is push-based
There is no message retention & there is a finite retry count
No Reliable Message Delivery
Whereas SQS ensures each message is seen by at least 1 consumer
SNS ensures that each message is seen by every consumer
Reliable Multi-Push
Can we work around this limitation while getting Reliable Multi-push?
![Page 59: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/59.jpg)
SNS + SQS Design Pattern
59
m1m2
m1m2
m1m2
SQS Q1
SQS Q2
SNS T1
Reliable Multi Push
Reliable Message Delivery
![Page 60: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/60.jpg)
SNS + SQS
60
Producer
Producer
Producer
m1m2Consumer
Consumer
Consumer
DB
m1
m1m2
m1m2
SQS Q1
SQS Q2
SNS T1
Consumer
Consumer
Consumer
ES
m1
![Page 61: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/61.jpg)
Batch Pipeline ArchitecturePutting the Pieces Together
61
![Page 62: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/62.jpg)
But First ….
62
![Page 63: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/63.jpg)
What Does Agari Do?
63
![Page 64: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/64.jpg)
What Does Agari Do?
64
Customersemail
metadataapply trust
models
email + trust score
Agari’s Current Product
![Page 65: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/65.jpg)
What Does Agari Do?
65
Enterprise Customers email
metadataapply trust
models
email md+ trust score
Agari’s Future Product
![Page 66: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/66.jpg)
Batch Pipeline ArchitecturePutting the Pieces Together
66
![Page 67: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/67.jpg)
Batch Architecture
67
•S3 to hold all source & computed data (Avro)
•EMR Spark for scoring + summarization
•Apache Airflow for hourly job scheduling
•SNS+SQS for messaging
•ASG Importer to import
•WebApp in Ruby-on-Rails
![Page 68: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/68.jpg)
Tackling Cost & TimelinessLeveraging the AWS Cloud
68
![Page 69: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/69.jpg)
Tackling Cost
69
Between Hourly Runs During Hourly Runs
![Page 70: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/70.jpg)
Tackling TimelinessAuto Scaling Group (ASG)
70
![Page 71: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/71.jpg)
ASG - Overview
71
What is it?
A means to automatically scale out/in clusters to handle variable load/traffic
A means to keep a cluster/service of a fixed size always up
![Page 72: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/72.jpg)
ASG - Data Pipeline
72
importer
importer
importer
importer
Importer ASG
scale out / inSQS
DB
![Page 73: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/73.jpg)
73
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant
ASG : CPU-based
![Page 74: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/74.jpg)
ASG : CPU-based
74
Sent
CPU
Recv
Premature Scale-in
Premature Scale-in:
• The CPU drops to noise-levels before all messages are consumed
• This causes scale in to occur while the last few messages are still being committed
![Page 75: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/75.jpg)
75
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)
This causes the ASG to grow
This causes the ASG to shrink
ASG : Queue-based
![Page 76: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/76.jpg)
76
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost• ASG • EMR Spark
• ASG • EMR Spark
![Page 77: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/77.jpg)
Tackling Operability & CorrectnessLeveraging Tooling
77
![Page 78: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/78.jpg)
78
A simple way to author and manage workflows
Provides visual insight into the state & performance of workflow runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements
![Page 79: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/79.jpg)
Apache AirflowWorkflow Automation & Scheduling
79
![Page 80: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/80.jpg)
80
Airflow: Author DAGs in Python! No need to bundle many config files!
Apache Airflow - Authoring DAGs
![Page 81: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/81.jpg)
81
Airflow: Visualizing a DAG
Apache Airflow - Authoring DAGs
![Page 82: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/82.jpg)
82
Airflow: It’s easy to manage multiple DAGs
Apache Airflow - Managing DAGs
![Page 83: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/83.jpg)
Apache Airflow - Perf. Insights
83
Airflow: Gantt chart view reveals the slowest tasks for a run!
![Page 84: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/84.jpg)
84
Apache Airflow - Perf. InsightsAirflow: Task Duration chart view show task completion time trends!
![Page 85: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/85.jpg)
85
Airflow: …And easy to integrate with Ops tools!Apache Airflow - Alerting
![Page 86: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/86.jpg)
86
Apache Airflow - Correctness
![Page 87: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/87.jpg)
87
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
![Page 88: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/88.jpg)
Near-Real Time Data PipelinesStream Processing @ Agari
88
![Page 89: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/89.jpg)
NRT Architecture
89
![Page 90: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/90.jpg)
NRT Architecture
90
![Page 91: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/91.jpg)
91
The Architecture is composed of repeated patterns of :
ASG-based compute
Kinesis streams (i.e. AWS’ managed “Kafka”)
Lambda-based Avro Schema Registry
NRT Architecture
![Page 92: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/92.jpg)
Avro Schema RegistryAvro Schema Storage
92
![Page 93: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/93.jpg)
93
{ "namespace":"com.agari.ep.collector.model", "type":"record", "doc":"This Schema describes the server-side configuration of Agari's Enterprise-Protect Collector", "name":"collector_config", "fields":[ {"name": "email_log_enabled", "type": "boolean"}, {"name": "email_log_interval_seconds", "type": ["int", "null"]}, {"name": "email_log_bucket_name", "type": "string"}, {"name": "phone_home_interval_seconds", "type": "int"}, {"name": "phone_home_sns_topic_ARN", "type": "string"}, {"name": "config_pull_interval_seconds", "type": "int"}, {"name": "receiver_netblocks", "type": ["null", {"type": "array", "items": "string"}]}, { "name": "connecting_ip", "type": [ "null", { "type": "record", "name": "connecting_ip_record", "fields": [ { "name": "received_header_index", "type": "int" }, ]
……
A self-describing (schema’d) serialization format
What is Avro?
![Page 94: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/94.jpg)
What is Avro?
94
Typically, the schema is stored in the same file as the data it represents
In HDFS, where files are typically large, the schema overhead is negligible
![Page 95: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/95.jpg)
Schema Registry
95
P
PC
C
C
Kinesis Stream
In streaming, where each record may be sent individually, the schema will be the majority of the data transmitted!
This is a fat message
Can we be smarter?
![Page 96: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/96.jpg)
Schema Registry
96
P
PC
C
C
Kinesis Stream
Schema Registry
(Lambda)
DynamoDB
get_schema_by_idregister_schema
![Page 97: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/97.jpg)
What is AWS Lambda?
97
AWS-hosted code execution environment (Python, Node, Java, Ruby)
You upload some code & specify a simple memory and CPU profile (e.g. medium CPU, 256 GB memory)
The code will get a new version (e.g. v2)
Code Rollback as easy as setting $LATEST alias to a previous version (e.g. $LATEST=v1)
![Page 98: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/98.jpg)
Elastic Stream ProcessingHow Do We Handle Increasing Traffic?
98
![Page 99: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/99.jpg)
Elastic Stream Processing
99
P
PC
C
C
Kinesis Stream
![Page 100: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/100.jpg)
Elastic Stream Processing
100
P
PC
C
C
Kinesis Stream
![Page 101: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/101.jpg)
Elastic Stream Processing
101
C
C
C
Kinesis Stream
C
C
C
P
P
Agari Scaling
Utils
![Page 102: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/102.jpg)
Open Source Plans
102
In late Q2/early Q3, we plan to open-source our cloud tools for :
• Avro Schema Registry &
• Agari (Kinesis+ASG) scaling tools
To be notified, follow @AgariEng & @r39132
![Page 103: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/103.jpg)
Acknowledgments
103
• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones
• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang
None of this work would be possible behind the contributions of the strong team below
![Page 104: Resilient Predictive Data Pipelines (GOTO Chicago 2016)](https://reader036.fdocuments.us/reader036/viewer/2022062316/5870656a1a28ab48378b4dd7/html5/thumbnails/104.jpg)
Questions? (@r39132)
104