Designing a Real Time Data Ingestion Pipeline
-
Upload
datascience -
Category
Technology
-
view
352 -
download
0
Transcript of Designing a Real Time Data Ingestion Pipeline
![Page 1: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/1.jpg)
Designing Real-TimeData Ingestion PipelineBadar Ahmed
![Page 2: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/2.jpg)
About Us
DataScience Inc. ▪ Data Science as a service▪ Customers from Sonos to Belkin▪ Ranked #1 among "Best Places to
Work in Los Angeles for 2015"
▪ Visit datascience.com!
2
Badar Ahmed ▪ Software Engineer▪ Background in high performance
computing & cloud computing▪ Work across the stack on Big Data
problems
![Page 3: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/3.jpg)
Importance of Data Ingestion
▪ Data ingestion is precursor to any analysis▪ Characteristics:▪ Reliable▪ Correctness▪ Speed▪ Scalable
3
![Page 4: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/4.jpg)
Types of Data Ingestion
▪ Broad topic with many different architectural patterns
▪ Real Time▪ Batch
▪ Structured Data▪ Unstructured Data
4
![Page 5: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/5.jpg)
Ingestion Evolution @ DataScience
5
▪ Legacy API existed▪ But ..
✦ Expensive✦ Ops Heavy✦ Hard to scale✦ No batch interface
![Page 6: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/6.jpg)
6
![Page 7: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/7.jpg)
What was needed
▪ Scaleable ingestion system▪ Batch Ingest▪ Lower Ops and $$$ Cost
7
![Page 8: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/8.jpg)
Idea #1
▪ Asynchronous API▪ Queue requests and process them later
Pros:▪ Fast▪ Scaleable
8
![Page 9: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/9.jpg)
9
![Page 10: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/10.jpg)
Issues with Idea #1
▪ Failure introduces complexity▪ Decoupled systems can be more
difficult to debug▪ User UX poorer if they need to keep
track of async requests▪ Lot of deviation from the simpler API
model of ConnectHQ
10
![Page 11: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/11.jpg)
Idea #2
▪ Synchronous Batch so ..✦ UX remains the same
▪ Use Concurrency to do parallel writes to datastore
✦ Caveat: Concurrent code is difficult to write & debug
11
![Page 12: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/12.jpg)
First Step: Prototype
12
![Page 13: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/13.jpg)
First Step: Prototype
13
![Page 14: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/14.jpg)
14
![Page 15: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/15.jpg)
15
![Page 16: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/16.jpg)
Integration Testing
16
![Page 17: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/17.jpg)
Integration Testing
17
![Page 18: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/18.jpg)
Unit Testing with Mocks
18
![Page 19: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/19.jpg)
More Testing
19
![Page 20: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/20.jpg)
More Testing
20
![Page 21: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/21.jpg)
Test & Refactor Cycle
21
![Page 22: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/22.jpg)
Test & Refactor Cycle
22
![Page 23: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/23.jpg)
23
![Page 24: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/24.jpg)
Questions?
24
![Page 25: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/25.jpg)
Thank you.
![Page 26: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/26.jpg)
Development
26
![Page 27: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/27.jpg)
Operations & Monitoring
27
![Page 28: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/28.jpg)
Operations & Monitoring
28
![Page 29: Designing a Real Time Data Ingestion Pipeline](https://reader036.fdocuments.us/reader036/viewer/2022081520/58cf29b61a28ab00168b49fb/html5/thumbnails/29.jpg)
Batch Data Loading
29