Real-time streams and logs with Storm and Kafka

Post on 27-Jan-2015

116 views 1 download

Tags:

description

Some of the biggest issues at the center of analyzing large amounts of data are query flexibility, latency, and fault tolerance. Modern technologies that build upon the success of “big data” platforms, such as Apache Hadoop, have made it possible to spread the load of data analysis to commodity machines, but these analyses can still take hours to run and do not respond well to rapidly-changing data sets. A new generation of data processing platforms -- which we call “stream architectures” -- have converted data sources into streams of data that can be processed and analyzed in real-time. This has led to the development of various distributed real-time computation frameworks (e.g. Apache Storm) and multi-consumer data integration technologies (e.g. Apache Kafka). Together, they offer a way to do predictable computation on real-time data streams. In this talk, we will give an overview of these technologies and how they fit into the Python ecosystem. As part of this presentation, we also released streamparse, a new Python that makes it easy to debug and run large Storm clusters. Links: * http://parse.ly/code * https://github.com/Parsely/streamparse * https://github.com/getsamsa/samsa

Transcript of Real-time streams and logs with Storm and Kafka

Real-time Streams & LogsAndrew Montalenti, CTO

Keith Bourgoin, Backend Lead

1 of 47

Agenda

Parse.ly problem space

Aggregating the stream (Storm)

Organizing around logs (Kafka)

2 of 47

Admin

Our presentations and code:

http://parse.ly/code

This presentation's slides:

http://parse.ly/slides/logs

This presentation's notes:

http://parse.ly/slides/logs/notes

3 of 47

What is Parse.ly?

4 of 47

What is Parse.ly?

Web content analytics for digital storytellers.

5 of 47

Velocity

Average post has <48-hour shelf life.

6 of 47

Volume

Top publishers write 1000's of posts per day.

7 of 47

Time series data

8 of 47

Summary data

9 of 47

Ranked data

10 of 47

Benchmark data

11 of 47

Information radiators

12 of 47

Architecture evolution

13 of 47

Queues and workers

Queues: RabbitMQ => Redis => ZeroMQ

Workers: Cron Jobs => Celery

14 of 47

Workers and databases

15 of 47

Lots of moving parts

16 of 47

In short: it started to get messy

17 of 47

Introducing Storm

Storm is a distributed real-time computation system.

Hadoop provides a set of general primitives for doing batchprocessing.

Storm provides a set of general primitives for doingreal-time computation.

Perfect as a replacement for ad-hoc workers-and-queuessystems.

18 of 47

Storm features

Speed

Fault tolerance

Parallelism

Guaranteed Messages

Easy Code Management

Local Dev

19 of 47

Storm primitives

Streaming Data Set, typically from Kafka.

ZeroMQ used for inter-process communication.

Bolts & Spouts; Storm's Topology is a DAG.

Nimbus & Workers manage execution.

Tuneable parallelism + built-in fault tolerance.

20 of 47

Wired Topology

21 of 47

Tuple Tree

Tuple tree, anchoring, and retries.

22 of 47

Word Stream Spout (Storm)

;; spout configuration{"word-spout" (shell-spout-spec

;; Python Spout implementation:;; - fetches words (e.g. from Kafka)["python" "words.py"]

;; - emits (word,) tuples["word"]

)}

23 of 47

Word Stream Spout in Python

import itertools

from streamparse import storm

class WordSpout(storm.Spout):

def initialize(self, conf, ctx):self.words = itertools.cycle(['dog', 'cat',

'zebra', 'elephant'])

def next_tuple(self):word = next(self.words)storm.emit([word])

WordSpout().run()

24 of 47

Word Count Bolt (Storm)

;; bolt configuration{"count-bolt" (shell-bolt-spec

;; Bolt input: Spout and field grouping on word{"word-spout" ["word"]}

;; Python Bolt implementation:;; - maintains a Counter of word;; - increments as new words arrive["python" "wordcount.py"]

;; Emits latest word count for most recent word["word" "count"]

;; parallelism = 2:p 2

)}

25 of 47

Word Count Bolt in Python

from collections import Counter

from streamparse import storm

class WordCounter(storm.Bolt):

def initialize(self, conf, ctx):self.counts = Counter()

def process(self, tup):word = tup.values[0]self.counts[word] += 1storm.emit([word, self.counts[word]])storm.log('%s: %d' % (word, self.counts[word]))

WordCounter().run()

26 of 47

streamparse

sparse provides a CLI front-end to streamparse, aframework for creating Python projects for running,debugging, and submitting Storm topologies for dataprocessing. (still in development)

After installing the lein (only dependency), you can run:

pip install streamparse

This will offer a command-line tool, sparse. Use:

sparse quickstart

27 of 47

Running and debugging

You can then run the local Storm topology using:

$ sparse runRunning wordcount topology...Options: {:spec "topologies/wordcount.clj", ...}#<StormTopology StormTopology(spouts:{word-spout=...storm.daemon.nimbus - Starting Nimbus with conf {...storm.daemon.supervisor - Starting supervisor with id 4960ac74...storm.daemon.nimbus - Received topology submission with conf {...... lots of output as topology runs...

Interested? Lightning talk!

28 of 47

Organizing around logs

29 of 47

Not all logs are application logs

A "log" could be any stream of structured data:

Web logs

Raw data waiting to be processed

Partially processed data

Database operations (e.g. mongo's oplog)

A series of timestamped facts about a given system.

30 of 47

LinkedIn's lattice problem

31 of 47

Enter the unified log

32 of 47

Log-centric is simpler

33 of 47

Parse.ly is log-centric, too

34 of 47

Introducing Apache Kafka

Log-centric messaging system developed at LinkedIn.

Designed for throughput; efficient resource use.

Persists to disk; in-memory for recent data

Little to no overhead for new consumers

Scalable to 10,000's of messages per second

As of 0.8, full replication of topic data.

35 of 47

Kafka concepts

Concept DescriptionCluster An arrangement of Brokers & Zookeeper

nodes

Broker An individual node in the Cluster

Topic A group of related messages (a stream)

Partition Part of a topic, used for replication

Producer Publishes messages to stream

ConsumerGroup

Group of related processes reading a topic

Offset Point in a topic that the consumer has read to

36 of 47

What's the catch?

Replication isn't perfect. Network partitions can causeproblems.

No out-of-order acknowledgement:

"Offset" is a marker of where consumer is in log;nothing more.

On a restart, you know where to start reading, butnot if individual messages before the stored offsetwas fully processed.

In practice, not as much of a problem as it sounds.

37 of 47

Kafka is a "distributed log"

Topics are logs, not queues.

Consumers read into offsets of the log.

Logs are maintained for a configurable period of time.

Messages can be "replayed".

Consumers can share identical logs easily.

38 of 47

Multi-consumer

Even if Kafka's availability and scalability story isn'tinteresting to you, the multi-consumer story should be.

39 of 47

Queue problems, revisited

Traditional queues (e.g. RabbitMQ / Redis):

not distributed / highly available at core

not persistent ("overflows" easily)

more consumers mean more queue server load

Kafka solves all of these problems.

40 of 47

Kafka + Storm

Good fit for at-least-once processing.

No need for out-of-order acks.

Community work is ongoing for at-most-once processing.

Able to keep up with Storm's high-throughput processing.

Great for handling backpressure during traffic spikes.

41 of 47

Kafka in Python (1)python-kafka (0.8+)

https://github.com/mumrah/kafka-python

from kafka.client import KafkaClientfrom kafka.consumer import SimpleConsumer

kafka = KafkaClient('localhost:9092')consumer = SimpleConsumer(kafka, 'test_consumer', 'raw_data')start = time.time()for msg in consumer:

count += 1if count % 1000 == 0:

dur = time.time() - startprint 'Reading at {:.2f} messages/sec'.format(dur/1000)start = time.time()

42 of 47

Kafka in Python (2)samsa (0.7x)

https://github.com/getsamsa/samsa

import timefrom kazoo.client import KazooClientfrom samsa.cluster import Cluster

zk = KazooClient()zk.start()cluster = Cluster(zk)queue = cluster.topics['raw_data'].subscribe('test_consumer')start = time.time()for msg in queue:

count += 1if count % 1000 == 0:

dur = time.time() - startprint 'Reading at {:.2f} messages/sec'.format(dur/1000)queue.commit_offsets() # commit to zk every 1k msgs

43 of 47

Other Log-Centric Companies

Company Logs WorkersLinkedIn Kafka* Samza

Twitter Kafka Storm*

Pinterest Kafka Storm

Spotify Kafka Storm

Wikipedia Kafka Storm

Outbrain Kafka Storm

LivePerson Kafka Storm

Netflix Kafka ???

44 of 47

Conclusion

45 of 47

What we've learned

There is no silver bullet data processing technology.

Log storage is very cheap, and getting cheaper.

"Timestamped facts" is rawest form of data available.

Storm and Kafka allow you to develop atop those facts.

Organizing around real-time logs is a wise decision.

46 of 47

Questions?

Go forth and stream!

Parse.ly:

http://parse.ly/code

http://twitter.com/parsely

Andrew & Keith:

http://twitter.com/amontalenti

http://twitter.com/kbourgoin

47 of 47