Samza at LinkedIn: Taking Stream Processing to the Next Level

Apache Samza: Taking stream processingto the next levelMartin Kleppmann — @martinkl

Martin KleppmannHacker, designer, inventor, entrepreneur

Co-founded two startups, Rapportive ⇒ LinkedIn Committer on Avro & Samza ⇒ Apache Writing book on data-intensive apps ⇒ O’Reilly martinkl.com | @martinkl

Apache Kafka Apache Samza

Credit: Jason Walsh on Flickrhttps://www.flickr.com/photos/22882695@N00/2477241427/

Credit: Lucas Richarz on Flickrhttps://www.flickr.com/photos/22057420@N00/5690285549/

Things we would like to do(better)

Provide timely, relevant updates to your newsfeed

Update search results with new information as it appears

“Real-time” analysis of logs and metrics

Tools?

Response latency

Kafka & Samza

Milliseconds to minutesLoosely coupled

SynchronousClosely coupled

Hours to days

Loosely coupled

Service 1

Kafkaevents/messages

Analytics Cache maintenance

Notifications

subscribe subscribe subscribe

publish publish

Service 2

Publish / subscribe Event / message = “something happened”– Tracking: User x clicked y at time z– Data change: Key x, old value y, set to new

value z– Logging: Service x threw exception y in request z– Metrics: Machine x had free memory y at time z

Many independent consumers High throughput (millions msgs/sec) Fairly low latency (a few ms)

Kafka at LinkedIn 350+ Kafka brokers 8,000+ topics 140,000+ Partitions

278 Billion messages/day 49 TB/day in 176 TB/day out

Peak Load– 4.4 Million messages per second– 6 Gigabits/sec Inbound– 21 Gigabits/sec Outbound

public interface StreamTask {void process(

IncomingMessageEnvelope envelope,MessageCollector

collector,TaskCoordinator

coordinator);}

Samza API: processing messages

getKey(), getMsg()

commit(), shutdown()sendMsg(topic, key, value)

Familiar ideas from MR/Pig/Cascading/… Filter records matching

condition Map record ⇒ func(record) Join two/more datasets by key Group records with the same

value in field Aggregate records within the same group Pipe job 1’s output ⇒ job 2’s input

MapReduce assumes fixed dataset.Can we adapt this to unbounded streams?

Operations on streams Filter records matching

condition ✔ easy Map record ⇒ func(record) ✔

easy Join two/more datasets by key

…within time window, need buffer

Group records with the same value in field

…within time window, need buffer

Aggregaterecords within the same group✔ ok… when do

you emit result? Pipe job 1’s output ⇒ job 2’s input

✔ ok… but what about faults?

Stateful stream processing(join, group, aggregate)

Joining streams requires state

User goes to lunch ⇒ click long after impression Queue backlog ⇒ click before impression “Window join”

Join and aggregate

Click-through rate

Key-valuestore

Ad impressionsAd clicks

Remote state or local state?

Samza job partition 0

e.g. Cassandra, MongoDB, …

100-500k msg/sec/node 100-500k msg/sec/node

1-5k queries/sec??

Remote state or local state?

LocalLevelDB/RocksDB

Another example: Newsfeed & following User 138 followed user 582 User 463 followed user 536 User 582 posted: “I’m at Berlin Buzzwords and it

rocks” User 507 unfollowed user 115 User 536 posted: “Nice weather today, going for

a walk” User 981 followed user 575

Expected output: “inbox” (newsfeed) for each user

Newsfeed & following

Fan out messages to followersDelivered messages

582 => [ 138, 721, …]

Follow/unfollow eventsPosted messages

User 582 posted: “I’m at BerlinBuzzwords and it rocks” User 138 followed user 582

Notify user 138: {User 582 posted:“I’m at Berlin Buzzwords and it rocks”}Push notifications etc.

Local state:

Bring computation and datatogether in one place

Fault tolerance

Kafka Kafka

YARN NodeManager YARN NodeManagerYARN

RM Samza Container

Samza Container

Machine 1 Machine 2

Task Task

Kafka Kafka

YARN NodeManager YARN NodeManagerYARN

RM Samza Container

Samza Container

Machine 1 Machine 2

Task Task

YARN NodeManager

Samza Container

YARN NodeManager

Samza Container

Machine 2

Task Task

Machine 3

Task Task

Fault-tolerant local state

LocalLevelDB/RocksDB

LocalLevelDB/RocksDBDurable

changelog

replicate writes

YARN NodeManager

Samza Container

YARN NodeManager

Samza Container

Machine 2

Task Task

Machine 3

Task Task

Samza’s fault-tolerant local state Embedded key-value: very fast Machine dies ⇒ local key-value store is lost

Solution: replicate all writes to Kafka! Machine dies ⇒ restart on another machine Restore key-value store from changelog Changelog compaction in the background (Kafka

0.8.1)

When things go slow…

Owned byTeam X

Team Y Team Z

Cascades of jobs

Stream BStream A

Job 2 Job 3

Job 4 Job 5

Consumer goes slow

Backpressure Queue upDrop data

Other jobs grindto a halt

Run out ofmemory

Spill to disk

Oh wait… Kafka does this anyway!

No thanks

Stream BStream A

Job 2 Job 3

Job 4 Job 5

Job 1 output

Job 2 output Job 3 output

Samza always writesjob output to Kafka

MapReduce always writesjob output to HDFS

Every job output is a named stream Open: Anyone can consume it Robust: If a consumer goes slow, nobody else is

affected Durable: Tolerates machine failure Debuggable: Just look at it Scalable: Clean interface between teams Clean: loose coupling between jobs

Problem SolutionNeed to buffer job outputfor downstream consumers

Write it to Kafka!

Need to make local statestore fault-tolerant

Write it to Kafka!

Need to checkpoint jobstate for recovery

Write it to Kafka!

Apache Kafka Apache Samza

kafka.apache.org samza.incubator.apache.org

Hello Samza (try Samza in 5 mins)

Thank you!Samza:• Getting started:

samza.incubator.apache.org• Underlying thinking:bit.ly/jay_on_logs• Start contributing:

bit.ly/samza_newbie_issues

Me:• Twitter: @martinkl• Blog: martinkl.com

Samza at LinkedIn: Taking Stream Processing to the Next Level

Engineering

Transcript of Samza at LinkedIn: Taking Stream Processing to the Next Level

StatSever-Samza: Near Real-Time Analytics

Samza: Stateful Scalable Stream Processing at LinkedIn · within seconds, ad campaigns to orient ads based on cur-rent user activity, and data from IoT (Internet of Things) to be

Apache Incubator Samza: Stream Processing at LinkedIn

Introduction to - · PDF fileTask running within Activity Streams deployment that sources documents for the stream, likely in their original data format.! ... Samza! Spark Streaming!

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Lambda-less Stream Processing @Scale in LinkedIn

Intro to Big Data · •Apache Samza is a stream processing framework that is tightly tied to the Apache Kafka messaging system. •While Kafka can be used by many stream processing

Data-Trace Types for Distributed Stream Processing Systemsmamouras/papers/Trace-Transductions.pdf · and Apache Samza [2] ... During compilation and deployment, this dataﬂow graph

Streaming@Twitter - IEEE Computer Societysites.computer.org/debull/A15dec/p15.pdf · Apache Samza [4] developed at LinkedIn, is a real-time, asynchronous computational framework for

Locality-Aware Routing in Stateful Streaming Applications · Apache Storm [23], Flink [1], S4 [18], Samza [2], and Twit-ter Heron [14], are examples of popular stream processing engines.

Architectures for massive data management - Apache Kafka, Samza ...

Apache samza past, present and future

Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream

Samza la hug

Data Stream Ingestion & Complex Event Processing Systems ... · PDF fileData Stream Ingestion & Complex Event Processing Systems for ... Apache Samza and Apache Storm have received

Samza: Stateful Scalable Stream Processing at LinkedIn · 2018-01-31 · Indranil Gupta*, and Roy H. Campbell* *University of Illinois at Urbana-Champaign, yLinkedIn Corp fabdolla2,

Samza tech talk_2015 - strata

A STORM ARCHITECTURE FOR FUSING IOT DATAp-comp.di.uoa.gr/resources/FUSION.pdf · Batch & Stream processing systems Apache Samza, real time stream processing system uses Apache Kafka

Building Real-time Data Products at LinkedIn with Apache Samza

Event Stream Processing with Kafka and Samza

Samza: Stateful Scalable Stream Processing at LinkedIn · 2018-01-31 · Indranil Gupta, and Roy H. Campbell *University of Illinois at Urbana-Champaign, yLinkedIn Corp fabdolla2,