Lambda-less Stream Processing @Scale in LinkedIn
Yi Pan (Apache Samza PMC/Committer)Kartik Paramasivam (Mgr -Streams Infra)
June, 2016
Agenda• Rise of Stream Processing Applications• Some Hard Problems in Stream Processing
–Data Accuracy–Reprocessing
• Conclusion
Newsfeed
Cyber-security
Internet of Things
Agenda• Rise of Stream Processing Applications• Some Hard Problems in Stream Processing
–Data Accuracy–Reprocessing
• Conclusion
Data Accuracy• Can Stream Processing generate accurate
results?–Yes.. but it is not trivial.
Case Study
Ads HTML
1:00pm
AdViewEventsAdQuality processor
Case Study
Ads HTML
1:01pm
AdViewEventsAdQuality processor
Click! AdClickEvents
Case Study
Ads HTML
1:01pmAdViewEvents
AdQuality processor
Click! AdClickEvents
Did AdClick happen
within 2min of AdView?
YesNo
Good AdBad Ad
Delays in Event Stream
Ad Quality Processor(Samza)
Services Tier
Kafka
Services Tier
Ad Quality Processor(Samza)
KafkaMirrored
Yi
DATACENTER 1 DATACENTER 2
AdViewEventLB
Real Time Processing
(Samza)
Services Tier
Kafka
Services Tier
Real Time Processing
(Samza)
KafkaMirrored
Yi
DATACENTER 1 DATACENTER 2
AdClick EventLB
Delays in Event Stream
Late Arrival
Real Time Processing
(Samza)
Services Tier
Kafka
Services Tier
Real Time Processing
(Samza)
KafkaMirrored
Yi
DATACENTER 1 DATACENTER 2
AdClick EventLB
Delays in Event stream
Out of Order Arrival
Lambda at LinkedIn
Real Time Processing
(Samza)
Batch Processing
(Hadoop/Spark)
Voldemort R/O
Processing
Bulk upload
Espresso
Services Tier
Ingestion Serving
Clients(browser,devices,..)
Kafka
• Basic Assumption : Batch jobs have full data-set
• But, how about edges?
Data Accuracy - with Lambda
Smaller batch size == more edges!Graph kudos to Stream Processing 101 from Tyler Akidau (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101)
10:00 11:00 12:00 13:00 system time
Fixing Lambda
Real Time Processing
(Samza)
Batch Processing
(Hadoop/Spark)
Voldemort R/O
Processing
Bulk upload
Espresso
Services Tier
Ingestion Serving
Clients(browser,devices, ….)
Kafka
Kafka Audit
Check Safe Start Time
Observation• Data Accuracy is still very hard with Lambda
–Additional system (e.g. Kafka Audit) has to be used to safely start the batch jobs
• Duplication in Online/Offline system: –Development cost–Operational overhead–Maintenance overhead
Going Lambda-less• Handle late arrivals and out of order arrivals • Eventually correct results
– Compute results at end of ‘window’.– Re-compute when events arrives late
• Influenced by “Google MillWheel”
Going Lambda-lessAdViewEvent
AdClickEvent
AdQuality processor
1:00pm1:01pm1:01pm1:02pm1:02pm
1:00pm1:02pm
Window output is computed at the end of window = (2min after the window is created)
window(“1:00pm”, “2min”)
Kafka
Kafka
Handling ‘late arrival’
1:00pm1:01pm1:01pm1:02pm1:02pm
1:00pm1:02pm
1:01pm
Late-arrival
Re-compute window(“1:00pm”, “2min”)
Kafka
Kafka
AdViewEvent
AdClickEvent
AdQuality processor
Handling ‘out of order arrival’
1:01pm1:02pm
1:00pm1:02pm
null join result in window(“1:00pm”, “2min”)
Kafka
Kafka
AdViewEvent
AdClickEvent
AdQuality processor
Handling ‘out of order arrival’
1:01pm1:02pm1:00pm1:01pm
1:00pm1:02pm
Re-compute window(“1:00pm”, “2min”)
Out-of-order arrival
Kafka
Kafka
AdViewEvent
AdClickEvent
AdQuality processor
SamzaContainer-1
Samza based SolutionKafka
AdClicks
SamzaContainer-0
Task1
Task2
Task3
AdView
Events are saved into RocksDB based local message store which is backed up durably in Kafka
Kafka
Samza Job
Changelog in Kafka
SamzaContainer-1
PerformanceKafka
AdClicks
SamzaContainer-0
Task1
Task2
Task3
AdView
Performance of Samza’s local RocksDB store:- 1.1 Million TPS (read/write) on single machine (ssd)- Largest production job has 1.5 Terabyte of local state
Kafka
Samza Job
Changelog in Kafka
Agenda• Rise of Stream Processing Applications• Some Hard Problems in Stream Processing
– Data Accuracy–Reprocessing
• Conclusion
Reprocessing • What is reprocessing ?
–Process events that happened in the past.
Case Study : Title Standardization
LinkedInProfile
change ‘Title’ :
Before: ArchitectAfter: Chief Technology Nerd
Title Standardizer
Search Ads ….
Title Standardizer - Implementation
outputMember Database(espresso)
Profile Updates
(Samza) Title-Standardizer
Machine Learningmodel
Kafka
Databus
Reprocessing - dealing with bugs
outputMember Database(espresso)
Profile Updates
(Samza) Title-Standardizer
Kafka
Databus
rewind 4 hours
Machine Learningmodel
Reprocessing - entire Dataset
outputMember Database(espresso)
Profile Updates
(Samza) Title-Standardizer
Kafka
Databus
Bootstrap
Backup
Database Backup (NFS)
set offset=0
Machine Learningmodel (NEW)
Reprocessing - entire Dataset Profile Updates
(Samza) Title-Standardizer
(Samza) Title-Standardizer
Bootstrap
Backup Machine Learningmodel (NEW)
output
Kafka
Databus
Databus
Member Database(espresso)
Database Backup (NFS)
set offset=0
Reprocessing - entire DatasetProfile Updates
(Samza) Title-Standardizer
(Samza) Title-Standardizer
BootstrapBackup
Machine Learningmodel (NEW)
output
Kafka
Databus
Databus
(Samza)Merge and
Store
Results
Reprocessing- Caveats• Stream processors are fast.. They can DOS the
system if you reprocess – Control max-concurrency of your job– Quotas for Kafka, Databases
• Reprocessing a 100 TB source ?–Capacity ?–Saturation of NICs, Top of rack switches
Reprocessing larger datasets Profile Updates
(Samza) Title-Standardizer
Machine Learningmodel
output
Kafka
Databus
(Samza)Merge and
Store
Results
Database Dump in HDFS
(Samza) Title-Standardizer
ML Model in HDFSHadoop
Experimentation
Database Dump in HDFS
(Samza) Title-Standardizer
Hadoop
ML Model in HDFS
Output in HDFS
● Offline experimentation before pushing the logic online○ Most datasets are already available in Hadoop (at LinkedIn)○ Fast Iteration with minimum impact to production
Conclusion1.It is possible to avoid code
duplication(hot/cold path) to support– Accuracy–Reprocessing
2. Some Lambda related problems still linger when reprocessing entire datasets
–e.g. merging online/reprocessing results
References• MillWheel: http://research.google.com/pubs/pub41378.html• DataFlow: http://research.google.com/pubs/pub41378.html• Samza: http://samza.apache.org/• Window Operator in Samza: https://issues.apache.org/jira/browse/SAMZA-552 • Lambda Architecture: https://www.manning.com/books/big-data• Stream Processing 101:
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 • Stream Processing 102:
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Thank You!
Top Related