Lambda-less stream processing - linked in

Post on 16-Apr-2017

426 views 1 download

Transcript of Lambda-less stream processing - linked in

Lambda-less Stream Processing @Scale in LinkedIn

Yi Pan (Apache Samza PMC/Committer)Kartik Paramasivam (Mgr -Streams Infra)

June, 2016

Agenda• Rise of Stream Processing Applications• Some Hard Problems in Stream Processing

–Data Accuracy–Reprocessing

• Conclusion

Newsfeed

Cyber-security

Internet of Things

Agenda• Rise of Stream Processing Applications• Some Hard Problems in Stream Processing

–Data Accuracy–Reprocessing

• Conclusion

Data Accuracy• Can Stream Processing generate accurate

results?–Yes.. but it is not trivial.

Case Study

Ads HTML

1:00pm

AdViewEventsAdQuality processor

Case Study

Ads HTML

1:01pm

AdViewEventsAdQuality processor

Click! AdClickEvents

Case Study

Ads HTML

1:01pmAdViewEvents

AdQuality processor

Click! AdClickEvents

Did AdClick happen

within 2min of AdView?

YesNo

Good AdBad Ad

Delays in Event Stream

Ad Quality Processor(Samza)

Services Tier

Kafka

Services Tier

Ad Quality Processor(Samza)

KafkaMirrored

Yi

DATACENTER 1 DATACENTER 2

AdViewEventLB

Real Time Processing

(Samza)

Services Tier

Kafka

Services Tier

Real Time Processing

(Samza)

KafkaMirrored

Yi

DATACENTER 1 DATACENTER 2

AdClick EventLB

Delays in Event Stream

Late Arrival

Real Time Processing

(Samza)

Services Tier

Kafka

Services Tier

Real Time Processing

(Samza)

KafkaMirrored

Yi

DATACENTER 1 DATACENTER 2

AdClick EventLB

Delays in Event stream

Out of Order Arrival

Lambda at LinkedIn

Real Time Processing

(Samza)

Batch Processing

(Hadoop/Spark)

Voldemort R/O

Processing

Bulk upload

Espresso

Services Tier

Ingestion Serving

Clients(browser,devices,..)

Kafka

• Basic Assumption : Batch jobs have full data-set

• But, how about edges?

Data Accuracy - with Lambda

Smaller batch size == more edges!Graph kudos to Stream Processing 101 from Tyler Akidau (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101)

10:00 11:00 12:00 13:00 system time

Fixing Lambda

Real Time Processing

(Samza)

Batch Processing

(Hadoop/Spark)

Voldemort R/O

Processing

Bulk upload

Espresso

Services Tier

Ingestion Serving

Clients(browser,devices, ….)

Kafka

Kafka Audit

Check Safe Start Time

Observation• Data Accuracy is still very hard with Lambda

–Additional system (e.g. Kafka Audit) has to be used to safely start the batch jobs

• Duplication in Online/Offline system: –Development cost–Operational overhead–Maintenance overhead

Going Lambda-less• Handle late arrivals and out of order arrivals • Eventually correct results

– Compute results at end of ‘window’.– Re-compute when events arrives late

• Influenced by “Google MillWheel”

Going Lambda-lessAdViewEvent

AdClickEvent

AdQuality processor

1:00pm1:01pm1:01pm1:02pm1:02pm

1:00pm1:02pm

Window output is computed at the end of window = (2min after the window is created)

window(“1:00pm”, “2min”)

Kafka

Kafka

Handling ‘late arrival’

1:00pm1:01pm1:01pm1:02pm1:02pm

1:00pm1:02pm

1:01pm

Late-arrival

Re-compute window(“1:00pm”, “2min”)

Kafka

Kafka

AdViewEvent

AdClickEvent

AdQuality processor

Handling ‘out of order arrival’

1:01pm1:02pm

1:00pm1:02pm

null join result in window(“1:00pm”, “2min”)

Kafka

Kafka

AdViewEvent

AdClickEvent

AdQuality processor

Handling ‘out of order arrival’

1:01pm1:02pm1:00pm1:01pm

1:00pm1:02pm

Re-compute window(“1:00pm”, “2min”)

Out-of-order arrival

Kafka

Kafka

AdViewEvent

AdClickEvent

AdQuality processor

SamzaContainer-1

Samza based SolutionKafka

AdClicks

SamzaContainer-0

Task1

Task2

Task3

AdView

Events are saved into RocksDB based local message store which is backed up durably in Kafka

Kafka

Samza Job

Changelog in Kafka

SamzaContainer-1

PerformanceKafka

AdClicks

SamzaContainer-0

Task1

Task2

Task3

AdView

Performance of Samza’s local RocksDB store:- 1.1 Million TPS (read/write) on single machine (ssd)- Largest production job has 1.5 Terabyte of local state

Kafka

Samza Job

Changelog in Kafka

Agenda• Rise of Stream Processing Applications• Some Hard Problems in Stream Processing

– Data Accuracy–Reprocessing

• Conclusion

Reprocessing • What is reprocessing ?

–Process events that happened in the past.

Case Study : Title Standardization

LinkedInProfile

change ‘Title’ :

Before: ArchitectAfter: Chief Technology Nerd

Title Standardizer

Search Ads ….

Title Standardizer - Implementation

outputMember Database(espresso)

Profile Updates

(Samza) Title-Standardizer

Machine Learningmodel

Kafka

Databus

Reprocessing - dealing with bugs

outputMember Database(espresso)

Profile Updates

(Samza) Title-Standardizer

Kafka

Databus

rewind 4 hours

Machine Learningmodel

Reprocessing - entire Dataset

outputMember Database(espresso)

Profile Updates

(Samza) Title-Standardizer

Kafka

Databus

Bootstrap

Backup

Database Backup (NFS)

set offset=0

Machine Learningmodel (NEW)

Reprocessing - entire Dataset Profile Updates

(Samza) Title-Standardizer

(Samza) Title-Standardizer

Bootstrap

Backup Machine Learningmodel (NEW)

output

Kafka

Databus

Databus

Member Database(espresso)

Database Backup (NFS)

set offset=0

Reprocessing - entire DatasetProfile Updates

(Samza) Title-Standardizer

(Samza) Title-Standardizer

BootstrapBackup

Machine Learningmodel (NEW)

output

Kafka

Databus

Databus

(Samza)Merge and

Store

Results

Reprocessing- Caveats• Stream processors are fast.. They can DOS the

system if you reprocess – Control max-concurrency of your job– Quotas for Kafka, Databases

• Reprocessing a 100 TB source ?–Capacity ?–Saturation of NICs, Top of rack switches

Reprocessing larger datasets Profile Updates

(Samza) Title-Standardizer

Machine Learningmodel

output

Kafka

Databus

(Samza)Merge and

Store

Results

Database Dump in HDFS

(Samza) Title-Standardizer

ML Model in HDFSHadoop

Experimentation

Database Dump in HDFS

(Samza) Title-Standardizer

Hadoop

ML Model in HDFS

Output in HDFS

● Offline experimentation before pushing the logic online○ Most datasets are already available in Hadoop (at LinkedIn)○ Fast Iteration with minimum impact to production

Conclusion1.It is possible to avoid code

duplication(hot/cold path) to support– Accuracy–Reprocessing

2. Some Lambda related problems still linger when reprocessing entire datasets

–e.g. merging online/reprocessing results

References• MillWheel: http://research.google.com/pubs/pub41378.html• DataFlow: http://research.google.com/pubs/pub41378.html• Samza: http://samza.apache.org/• Window Operator in Samza: https://issues.apache.org/jira/browse/SAMZA-552 • Lambda Architecture: https://www.manning.com/books/big-data• Stream Processing 101:

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 • Stream Processing 102:

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Thank You!