Twitter Stream Processing

26
Twitter stream processing Colin Surprenant @colinsurprenant Lead ninja Big Data Montreal April 2012

description

Process the Twitter stream using Storm & Redstorm with Ruby & JRuby. Full working demo, code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/

Transcript of Twitter Stream Processing

Page 1: Twitter Stream Processing

Twitterstream processing

Colin Surprenant@colinsurprenantLead ninja

Big Data MontrealApril 2012

Page 2: Twitter Stream Processing

Twitter - Spring 2012

● 350 million tweets/day● 140 million active users● >1 million applications using API

Page 3: Twitter Stream Processing

Daily Twitter @ Needium

50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data

Page 4: Twitter Stream Processing

Anatomy of a tweet

What's in a tweet? Anything else?

avatar

usertimestamp

message

Page 5: Twitter Stream Processing
Page 6: Twitter Stream Processing

How to get the tweets?

● Streaming API

Subscribe to realtime feeds moving forward ● REST Search API

Search request on past data (1 week)

Page 7: Twitter Stream Processing

Streaming APIpublic statuses from all users

● status/filtertrack/location/follow○ 5000 follow user ids

○ 400 track keywords

○ 25 location boxes

○ rate limited

● status/sample○ 1% of all public statuses (message id mod 100)

○ two status/sample streams will result in same data

Page 8: Twitter Stream Processing

Streaming APIper user streams

● User Streams

○ all data required to update a user's display○ requires user's OAuth token○ statuses from followings, direct messages, mentions○ cannot open large number of user streams from same host

● Site Streams○ multiplexing of multiple User Streams

Page 9: Twitter Stream Processing

Streaming APIFirehose

need more/full data? only through partners ● gnip.com● datasift.com

● filtering/tracking

● partial to full Firehose

What's the catch? $$$lots of it

Page 10: Twitter Stream Processing

Search API

● REST API (http request/response)

○ search query○ geocode (lat, long, radius)○ result type (mixed/recent/popular)○ since id

● max 100 rpp and 1500 results● rate limited (~1 request/sec)

Page 11: Twitter Stream Processing

StormDistributed and fault-tolerant realtime computation

https://github.com/nathanmarz/storm

Page 12: Twitter Stream Processing

Storm

The promise ● Guaranteed data processing● Horizontal scalability● Fault-tolerance● No intermediate message brokers● Higher level abstraction than message passing● Just work

Page 13: Twitter Stream Processing

RedStorm

Storm: Java + Closure

RedStorm: JRuby integration & DSL for Storm

Ruby + Storm on JVM

https://github.com/colinsurprenant/redstorm

+

Page 14: Twitter Stream Processing

StormTypical use cases

Streamprocessing

Continuouscomputation

DistributedRPC

Page 15: Twitter Stream Processing

StormConcepts

Streams

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple

Page 16: Twitter Stream Processing

StormConcepts

Spouts

Source of streams

TupleTuple

TupleTuple

Tuple

TupleTuple

TupleTuple

Tuple

Page 17: Twitter Stream Processing

StormConcepts

Bolts

Processes input streams and produce new streams

Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple

Page 18: Twitter Stream Processing

StormConcepts

Topology

Network of spouts and bolts

Page 19: Twitter Stream Processing

StormConcepts

bolt A bolt B

bolt A bolt B

field X

field Y

bolt A bolt B

bolt A bolt B

random+fair distribution across tasks

replication to all tasks

partition on specified field entire stream to single task

Shuffle

GlobalFields

All

Grouping

Page 20: Twitter Stream Processing

Storm

What Storm does ● Distributes code● Robust process management● Monitors topologies and reassigns failed tasks● Provides reliability by tracking tuple trees● Routing and partitioning of stream

Page 21: Twitter Stream Processing

TweitgeistLive top 10 trending hashtags on Twitter

DEMO

https://github.com/colinsurprenant/tweitgeistLive demo: http://tweitgeist.needium.com

Page 22: Twitter Stream Processing

TweitgeistArchitecture

Twitterstream

Queue

Parallel extraction

Partitionedcounting

Partitionedcounting

Partitionedranking

Partitionedranking

Merging

Queue

Frontend

N partitions N partitions

Page 23: Twitter Stream Processing

TweitgeistArchitecture

● Parallel computation across partitions of the stream● Scalable architecture● Work for full Twitter firehose with very little mods

Page 24: Twitter Stream Processing

TweitgeistStorm Topology

TwitterStreamSpout

ExtractMessageBolt

ExtractHashtagBolt

RollingCountBolt

shuf

fle

shuf

fle

field

hash

tag

RankBolt

field

hash

tag

MergeBolt

glob

al

Shuffle grouping: Tuples are randomly distributed across the bolt's tasksFields grouping: The stream is partitioned by the fields specified in the groupingGlobal grouping: The entire stream goes to a single one of the bolt's tasks

Redisqueue

Redisqueue

Twitter U

I

streamreader

messageextract

hashtagextract

rollingcounter

ranking merging

Page 25: Twitter Stream Processing

TweitgeistTopology definition

Page 26: Twitter Stream Processing

TweitgeistWhere

Codehttps://github.com/colinsurprenant/tweitgeist

Live demohttp://tweitgeist.needium.com