Cloud Native Data Pipelines (DataEngConf SF 2017)
Transcript of Cloud Native Data Pipelines (DataEngConf SF 2017)
![Page 1: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/1.jpg)
Cloud Native Data Pipelines
1
Sid Anand (@r39132) DataEngConf SF 2017
![Page 2: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/2.jpg)
About Me
2
Work [ed | s] @
Committer & PPMC on
Father of 2
Co-Chair for
Apache Airflow
![Page 3: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/3.jpg)
Agari
3
What We Do!
![Page 4: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/4.jpg)
Agari : What We Do
4
![Page 5: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/5.jpg)
5
Agari : What We Do
![Page 6: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/6.jpg)
6
Agari : What We Do
![Page 7: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/7.jpg)
7
Agari : What We Do
![Page 8: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/8.jpg)
8
Agari : What We Do
![Page 9: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/9.jpg)
9
Enterprise Customers
email metadata
apply trust
modelsemail md + trust score
Agari’s Previous EP Version
Agari : What We Do
Batch
![Page 10: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/10.jpg)
10
email metadata
apply trust
modelsemail md + trust score
Agari’s Current EP VersionEnterprise Customers
Agari : What We Do
Near-real time
Quarantine, Label,
PassThrough
![Page 11: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/11.jpg)
Data PipelinesBI vs Predictive
11
![Page 12: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/12.jpg)
Data Pipelines (BI)
12
WebServers
OLTPDB
DataWarehouse
Repor6ngTools
QueryBrowsers
ETL(batch)MySQL,Oracle,Cassandra
Terradata,RedShi;BigQuery
![Page 13: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/13.jpg)
OLTPDBorcache
ETL(batchorstreaming)
MySQL,Oracle,Cassandra,Redis
Spark,Flink,Beam,Storm
WebServers
DataProductsRanking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon
DataSource
Data Pipelines (Predictive)
13
![Page 14: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/14.jpg)
Data Products
14
![Page 15: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/15.jpg)
BI Predictive
Common Focus of this talk
Data Pipelines
15
WebServers
OLTPDB
DataWarehouse
Repor6ngTools
QueryBrowsers
ETL(batch)MySQL,Oracle,Cassandra
Terradata,RedShi;BigQuery
OLTPDBorcache
ETL(batchorstreaming)
MySQL,Oracle,Cassandra,Redis
Spark,Flink,Beam,Storm
WebServers
Ranking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon
DataSource
![Page 16: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/16.jpg)
MotivationCloud Native Data Pipelines
16
![Page 17: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/17.jpg)
Cloud Native Data Pipelines
17
Big Data Companies like LinkedIn, Facebook, Twitter, & Google have large teams to manage their data pipelines
Most start-ups run in the public cloud. Can they leverage aspects of the public cloud to build comparable pipelines?
![Page 18: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/18.jpg)
Cloud Native Data Pipelines
18
Cloud Native Techniques
Open Source Technogies
Data Pipelines seen in Big Data companies
~
![Page 19: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/19.jpg)
Design GoalsDesirable Qualities of a Resilient Data Pipeline
19
![Page 20: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/20.jpg)
20
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
![Page 21: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/21.jpg)
21
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
• Minimize Operational Fatigue / Automate Everything
• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go
![Page 22: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/22.jpg)
Predictive Analytics @ AgariUse Cases
22
![Page 23: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/23.jpg)
Use Cases
23
Apply trust models (message scoring)
batch + near real time
Build trust models
batch
(Enterprise Protect)
![Page 24: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/24.jpg)
Use Cases
24
Apply trust models (message scoring)
batch + near real time
Build trust models
batch
(Enterprise Protect)Focus of this talk
![Page 25: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/25.jpg)
Use-Case : Message Scoring (batch)Batch Pipeline Architecture
25
![Page 26: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/26.jpg)
Use-Case : Message Scoring
26
enterprise Aenterprise Benterprise C
S3
S3 uploads an Avro file every 15 minutes
![Page 27: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/27.jpg)
Use-Case : Message Scoring
27
enterprise Aenterprise Benterprise C
S3
Airflow kicks of a Spark message scoring job
every hour (EMR)
![Page 28: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/28.jpg)
Use-Case : Message Scoring
28
enterprise Aenterprise Benterprise C
S3
Spark job writes scored messages and stats to
another S3 bucket
S3
![Page 29: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/29.jpg)
Use-Case : Message Scoring
29
enterprise Aenterprise Benterprise C
S3
This triggers SNS/SQS messages events
S3
SNS
SQS
![Page 30: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/30.jpg)
Use-Case : Message Scoring
30
enterprise Aenterprise Benterprise C
S3
An Autoscale Group (ASG) of Importers spins up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
![Page 31: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/31.jpg)
31
enterprise Aenterprise Benterprise C
S3
The importers rapidly ingest scored messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
![Page 32: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/32.jpg)
32
enterprise Aenterprise Benterprise C
S3
Users receive alerts of untrusted emails & can review them in
the web app
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
![Page 33: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/33.jpg)
33
enterprise Aenterprise Benterprise C
S3 S3
SNS
SQS
Importers
ASGDB
Airflow manages the entire process
Use-Case : Message Scoring
![Page 34: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/34.jpg)
34
Architectural ComponentsComponent Role Uses Salient Features Operability Model
Data Lake • All data stored in S3 • All processing uses S3
Scalable, Available, Performant Serverless
Messaging • Reliable, Transactional, Pub/Sub
Scalable, Available, Performant Serverless
ASG General Processing
• Used for importing, data cleansing, business logic
Scalable, Available, Performant Managed
Data Science Processing
• Aggregation • Model Building • Scoring
Nice programming model at the cost of
debugging complexityWe Operate
Workflow Engine
• Coordinates all Spark Jobs & complex flows
Lightweight, DAGs as Code, Steep learning
curveWe Operate
DB Persistence for WebApp
• Holds subset of data needed for Web App Rails + Postgres
‘nuff said We Operate
S3
SNS SQS
![Page 35: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/35.jpg)
Tackling Cost & TimelinessLeveraging the AWS Cloud
35
![Page 36: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/36.jpg)
Tackling Cost
36
Between Daily Runs During Daily Runs
When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR
![Page 37: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/37.jpg)
Tackling Cost
37
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR
This does not help when runs are hourly since AWS charges at an hourly rate for EC2 instances!
![Page 38: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/38.jpg)
Tackling TimelinessAuto Scaling Group (ASG)
38
![Page 39: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/39.jpg)
ASG - Overview
39
What is it?
A means to automatically scale out/in clusters to handle variable load/traffic
A means to keep a cluster/service of a fixed size always up
![Page 40: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/40.jpg)
ASG - Data Pipeline
40
importer
importer
importer
importer
Importer ASG
scale out / inSQS
DB
![Page 41: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/41.jpg)
41
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant
ASG : CPU-based
![Page 42: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/42.jpg)
ASG : CPU-based
42
Sent
CPU
Recv
Premature Scale-in
Premature Scale-in:
• The CPU drops to noise-levels before all messages are consumed
• This causes scale in to occur while the last few messages are still being committed
![Page 43: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/43.jpg)
43
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)
This causes the ASG to grow
This causes the ASG to shrink
ASG : Queue-based
![Page 44: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/44.jpg)
Auto Scaling GroupsBuild & Deploy
44
![Page 45: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/45.jpg)
ASG - Build & Deploy
45
Component Role Details
Spins up Cloud Resources• Spins up SQS, Kinesis, EC2, ASG,
ELB, etc.. and associate them using Terraform
• A better version of Chef & Puppet
• Sets up an EC2 instance
• Agentless, idempotent, & declarative tool to set up EC2 instances, by installing & configuring packages, and more
• Spins up an EC2 instance for the purposes of building an AMI!
• Can be used with Ansible & Terraform to bake AMIs & Launch Auto-Scaling Groups
![Page 46: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/46.jpg)
ASG - Build & Deploy
46
EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
![Page 47: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/47.jpg)
EC2
ASG - Build & Deploy
47
EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
Step 2 : Packer runs an Ansible role against the EC2 node to set it up.
![Page 48: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/48.jpg)
EC2
ASG - Build & Deploy
48
EC2
Step 2 : Packer runs an Ansible role against the EC2 node to set it up.
Step 3 : Snapshots the machine & register the AMI.EC2
Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
![Page 49: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/49.jpg)
EC2
ASG - Build & Deploy
49
EC2
Step 2 : Packer runs an Ansible role against the EC2 node to set it up.
Step 3 : Snapshots the machine & register the AMI.EC2
Step 4 : Terminates the EC2 instance!
Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
![Page 50: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/50.jpg)
EC2
ASG - Build & Deploy
50
EC2
Step 2 : Packer runs an Ansible role against the EC2 node to set it up.
Step 3 : Snapshots the machine & register the AMI.EC2
Step 4 : Terminates the EC2 instance!
Step 5 : Using the AMI, Terraform spins up an auto-scaled compute cluster (ASG)
Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
ASG
![Page 51: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/51.jpg)
51
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost• ASG • EMR Spark
Daily • ASG • EMR Spark Hourly ASG • No Cost Savings
![Page 52: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/52.jpg)
Tackling Operability & CorrectnessLeveraging Tooling
52
![Page 53: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/53.jpg)
53
A simple way to author, configure, manage workflows
Provides visual insight into the state & performance of workflow runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements
![Page 54: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/54.jpg)
Apache AirflowWorkflow Automation & Scheduling
54
![Page 55: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/55.jpg)
55
Airflow: Author DAGs in Python! No need to bundle many config files!
Apache Airflow - Authoring DAGs
![Page 56: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/56.jpg)
56
Airflow: Visualizing a DAG
Apache Airflow - Authoring DAGs
![Page 57: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/57.jpg)
57
Airflow: It’s easy to manage multiple DAGs
Apache Airflow - Managing DAGs
![Page 58: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/58.jpg)
Apache Airflow - Perf. Insights
58
Airflow: Gantt chart view reveals the slowest tasks for a run!
![Page 59: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/59.jpg)
59
Apache Airflow - Perf. InsightsAirflow: Task Duration chart view show task completion time trends!
![Page 60: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/60.jpg)
60
Airflow: …And easy to integrate with Ops tools!Apache Airflow - Alerting
![Page 61: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/61.jpg)
61
Apache Airflow - Correctness
![Page 62: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/62.jpg)
62
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
![Page 63: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/63.jpg)
Use-Case : Message Scoring (near-real time)NRT Pipeline Architecture
63
![Page 64: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/64.jpg)
Use-Case : Message Scoring
64
enterprise Aenterprise Benterprise C
Kinesis batch put every second
K
![Page 65: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/65.jpg)
Use-Case : Message Scoring
65
enterprise Aenterprise Benterprise C
K
As ASG of scorers is scaled up to one process per core per kinesis shard
Scorers
ASG
![Page 66: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/66.jpg)
Use-Case : Message Scoring
66
enterprise Aenterprise Benterprise C
KScorers
ASG
KinesisScorers apply the trust model and send scored messages downstream
![Page 67: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/67.jpg)
Use-Case : Message Scoring
67
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
As ASG of importers is scaled up to rapidly import messages
DB
![Page 68: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/68.jpg)
Use-Case : Message Scoring
68
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
Imported messages are also consumed by the
alerter
DB
K
Alerters
ASG
![Page 69: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/69.jpg)
Use-Case : Message Scoring
69
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
Imported messages are also consumed by the
alerter
DB
K
Alerters
ASG
Quarantine Email
![Page 70: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/70.jpg)
70
Stream Processing ArchitectureComponent Role Details Pros Operability Model
Data Lake • All data stored in S3 via Kinesis Firehose
Scalable, Available, Performant, Serverless Serverless
Kinesis Messaging • Streaming transport modeled on Kafka
Scalable, Available, Serverless Serverless
General Processing
• ASG Replacement except for Rails Apps Scalable, Available,
Serverless Serverless
ASG General Processing
• Used for importing, data cleansing, business logic
Scalable, Available, Managed Managed
Data Science Processing
• Model Building We Operate
Workflow Engine• Nightly model builds +
some classic Ops cron workloads
Lightweight, DAGs as Code We Operate
DB Persistence for WebApp
• Holds smaller subset of data needed for Web App
Rails + Postgres ‘nuff said We Operate
Persistence for WebApp
• Aggregation + Search moved from DB to ES
• Model Building queries moved to Elasticache Redis
Faster. more accurate for aggregates, frees up
headroom for DB (polyglot persistence)
Managed
S3
![Page 71: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/71.jpg)
InnovationsNRT Pipeline Architecture
71
![Page 72: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/72.jpg)
Apache AvroWhat is Avro?
72
![Page 73: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/73.jpg)
73
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
![Page 74: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/74.jpg)
74
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
The most common format for storing structured Big Data at rest in HDFS, S3, Google Cloud Storage, etc…
Supports Schema Evolution!
![Page 75: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/75.jpg)
Apache AvroWhy is it useful?
75
![Page 76: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/76.jpg)
76
Why is Avro Useful?Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS
Data is sent via Kinesis!
enterprise Aenterprise Benterprise C Kinesis
Agari SAAS in AWS
![Page 77: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/77.jpg)
77
Why is Avro Useful?
enterprise A :enterprise B :enterprise C : Kinesis
v1v2v3
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the Agari Sensor
Agari SAAS in AWS
![Page 78: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/78.jpg)
78
Why is Avro Useful?
enterprise A :enterprise B :enterprise C : Kinesis
v1v2v3
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the Agari Sensor
These Sensors might send different format versions of the data!
Agari SAAS in AWS
![Page 79: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/79.jpg)
79
Why is Avro Useful?
enterprise A :enterprise B :enterprise C : Kinesis
v1v2v3
Agari SAAS in AWS
v4
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the Agari Sensor
These Sensors might send different format versions of the data!
![Page 80: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/80.jpg)
80
Why is Avro Useful?
enterprise A :enterprise B :enterprise C :
v1v2v3
Avro allows Agari to seamlessly handle different IoT data format versions
Agari SAAS in AWS
Kinesis v4
datum_reader = DatumReader( writers_schema = writers_schema,
readers_schema = readers_schema)
Requirements:
• Schemas are backward-compatible
![Page 81: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/81.jpg)
81
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Avro is so useful, we don’t just to communicate between our Sensors & our SAAS infrastructure
We also use it as the common data-interchange format between all services (streaming & batch) within our AWS deployment
![Page 82: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/82.jpg)
82
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Good Language Bindings :
Data Pipelines services are written in Java, Ruby, & Python
![Page 83: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/83.jpg)
Apache AvroBy Example
83
![Page 84: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/84.jpg)
84
Avro Schema Example
{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
![Page 85: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/85.jpg)
85
{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
complex type (record)
Avro Schema Example
![Page 86: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/86.jpg)
86
{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
complex type (record)Schema name : User
Avro Schema Example
![Page 87: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/87.jpg)
87
{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
complex type (record)Schema name : User
3 fields in the record: 1 required, 2 optional
Avro Schema Example
![Page 88: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/88.jpg)
88
{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
Data
x 1,000,000,000
Avro Schema Data File Example
Schema
Data
0.0001 %
99.999 %
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
![Page 89: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/89.jpg)
89
{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
![Page 90: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/90.jpg)
90
{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
OVERHEAD!!
![Page 91: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/91.jpg)
Apache AvroSchema Registry
91
![Page 92: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/92.jpg)
92
Schema Registry
(Lambda)
Avro Schema Registry
{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
register_schema
Message Producer (P)
![Page 93: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/93.jpg)
93
Schema Registry
(Lambda)
register_schema returns a UUID
Message Producer (P)
Avro Schema Registry
![Page 94: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/94.jpg)
94
Schema Registry
(Lambda)
Message Producer sends UUID +
Message Producer (P)
Data
Message Consumer (C)
Avro Schema Registry
![Page 95: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/95.jpg)
95
Schema Registry
(Lambda)
Message Producer (P)
Data
Message Consumer (C)
getSchemaById (UUID)
Avro Schema Registry
![Page 96: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/96.jpg)
96
Schema Registry
(Lambda)
Message Producer (P)
Data
Message Consumer (C)
getSchemaById (UUID){"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
Avro Schema Registry
![Page 97: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/97.jpg)
97
Schema Registry
(Lambda)
Message Producer (P)
Message Consumer (C)
getSchemaById (UUID){"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
Message Consumers • download & cache the schema
• then decode the data
Avro Schema Registry
![Page 98: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/98.jpg)
98
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
Imported messages are also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Avro Schema Registry
![Page 99: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/99.jpg)
99
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
Imported messages are also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Avro Schema Registry
![Page 100: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/100.jpg)
Acknowledgments
100
• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Chris Buchanan • Neil Chapin • Wil Collins • Don Spencer
• Scot Kennedy • Natia Chachkhiani • Patrick Cockwell • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle • Gabriel Poon • Spencer Sun • Nathan Bryant
None of this work would be possible without the essential contributions of the team below
![Page 101: Cloud Native Data Pipelines (DataEngConf SF 2017)](https://reader034.fdocuments.us/reader034/viewer/2022051710/5a65d0967f8b9aaf638b4ad1/html5/thumbnails/101.jpg)
Questions? (@r39132)
101