Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

29
Apache Samza Reliable Stream Processing Atop Apache Kafka and Hadoop YARN Jakob Homan London HUG

description

Overview of Apache Samza presented to the London HUG, October 22, 2013

Transcript of Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Page 1: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Apache Samza

Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Jakob Homan London HUG

Page 2: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Who I am

• Samza for five months• Before that Hadoop, Hive, Giraph• Say hi: @blueboxtraveler

Page 3: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Things we would like to do(better)

Page 4: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Provide timely, relevant updates to your newsfeed

Page 5: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Update search results with new information as it appears

Page 6: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Sculpt metrics and logs into useful shapes

Page 7: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Tools?

Response latency

Samza

Milliseconds to minutes

RPC

Synchronous Later. Possibly much later.

Page 8: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Frame(work) of reference

ClassicHadoop

Samza

Storage layerExecutionengine API

HDFS

Kafka

Map-Reduce

YARN

map(k, v) => (k,v)reduce(k, list(v)) => (k,v)

process(msg(k,v)) => msg(k,v)

Page 9: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Storage layer: Kafka

Page 10: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Apache Kafka

• Persistent, reliable,distributed message queue

Shiny new logo!

Page 11: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

At LinkedIn

10+ billionwrites per day

172kmessages per second

(average)

55+ billionmessages per day

to real-time consumers

Page 12: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Quick aside…

Kafka: First among (pluggable) equals

LinkedIn: Espresso and Databus

Coming soon? HDFS, ActiveMQ, Amazon SQS

Page 13: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Kafka in four bullet points

• Producers send messages to brokers• Messages are key, value pairs• Brokers store messages in topics for

consumers• Consumers pull messages from brokers

Page 14: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

A Kafka Topic

“Very sleepy”53 4 “Car nicked!”75 5 “The ref’s blind!”23 4 “Nicked a car!”53 4

Topic: StatusUpdateEvent

Key: User ID of user who updated the status

Value: Timestamp, new status, geolocation, etc.

Page 15: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Kafka topics are partitioned

Message contentsKe y Message

contentsKe yMessage contentsKe y Message

contentsKe y Message contentsKe y Message

contentsKe y

Message contentsKe y Message

contentsKe y Message contentsKe y Message

contentsKe yPartition 0

Partition 1

Partition 2

For our purposes, hash partitioned on the key!

Page 16: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

A Samza job

Input topics

• StatusUpdateEvent• NewConnectionEvent• LikeUpdateEvent

Some code

MyStreamTask implements StreamTask{ …………. }

Output topics

• NewsUpdatePost• UpdatesPerHourMetric

Page 17: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Execution engine: YARN

Page 18: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

What we use YARN for

• Distributing our tasks across multiple machines

• Letting us know when one has died• Distributing a replacement• Isolating our tasks from each other

Page 19: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Machine 1 Machine 1

YARN: Execution and reliability

MyStreamTask:process()

Samza TaskRunner: Partition 0

MyStreamTask:process()

Samza TaskRunner: Partition 1

Node Manager 2Node Manager 1

Samza App Master

Kafka Broker Kafka Broker

Page 20: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Co-partitioning of topics

MyStreamTask:process()

Samza TaskRunner: Partition 0StatusUpdateEvent, Partition 0

NewConnectionEvent, Partition 0

NewsUpdatePost

An instance of StreamTask is responsible for a specific partition

Page 21: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

API: process()

Page 22: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

public interface StreamTask { void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator ) }

getKey(), getMsg()

sendMsg(topic, key, value)commit(), shutdown()

Page 23: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Awesome feature: State

• Generic data store interface• Key-value out-of-box– More soon? Bloom filter, lucene, etc.

• Restored by Samza upon task crash

MyStreamTask:process()

Samza TaskRunner: Partition 0

Store state

Page 24: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

(Pseudo)code snippet: Newsfeed

• Consume StatusUpdateEvent– Send those updates to all your conmections via

the NewsUpdatePost topic• Consume NewConnectionEvent– Maintain state of connections to know who to

send to

Page 25: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

public class NewsFeed implements StreamTask { void process(envelope, collector, coordinator) { msg = env.getMsg() userId = msg.get(“userID”); if(msg.get(“type”)==STATUS_UPDATE) { foreach(conn: kvStore.get(userId) { collector.send(“NewsUpdatePost”, new Msg(conn, msg.get(“newStatus”))

} } else { newConn = msg.get(“newConnection”) connections = kvStore.get(userId) kvStore.put(userID, connections ++ newConn) }

Page 26: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Current status

Page 27: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Hello, Samza!

Cool, eh? bit.ly/hello-samza

Consume Wikipedia edits live

Up and running in 3 minutes

Generate stats on those edits

Page 28: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

samza.incubator.apache.org bit.ly/samza_newbie_issues

Page 29: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Cheers!

• Quick start: bit.ly/hello-samza• Project homepage: samza.incubator.apache.org• Newbie issues: bit.ly/samza_newbie_issues• Detailed Samza and YARN talk: bit.ly/samza_and_yarn• Twitter: @samzastream