Real Time Analytics for Big Data - A twitter inspired case study

38
Real Time Analytics for Big Data A Twitter Case Study Uri Cohen Head of Product @uri1803

description

Real Time Analytics for Big Data - A twitter inspired case study

Transcript of Real Time Analytics for Big Data - A twitter inspired case study

Page 1: Real Time Analytics for Big Data - A twitter inspired case study

Real Time Analytics for Big DataA Twitter Case Study

Uri Cohen Head of Product

@uri1803

Page 2: Real Time Analytics for Big Data - A twitter inspired case study

Big Data Predictions

“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”

Edd Dumbill, O’REILLY

2® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 3: Real Time Analytics for Big Data - A twitter inspired case study

We’re Living in a Real Time World…Google RT Web Analytics

Google Real Time Search

Facebook Real Time Social Analytics

Twitter paid tweet analytics

Real Time User Engagement

New Real Time Analytics Startups

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved3

Page 4: Real Time Analytics for Big Data - A twitter inspired case study

3 Flavors to Analytics

Counting Correlating Research

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4

Page 5: Real Time Analytics for Big Data - A twitter inspired case study

Analytics @ Twitter – Counting

How many signups, tweets, retweets for a topic?

What’s the average latency?

Demographics Countries and cities Gender Age groups Device types …

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved5

Page 6: Real Time Analytics for Big Data - A twitter inspired case study

Analytics @ Twitter – Correlating

What devices fail at the same time?

What features get user hooked?

What places on the globe are “happening”?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6

Page 7: Real Time Analytics for Big Data - A twitter inspired case study

Analytics @ Twitter – Research

Sentiment analysis “Obama is popular”

Trends “People like to tweet

after watching American Idol”

Spam patterns How can you tell when

a user spams?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7

Page 8: Real Time Analytics for Big Data - A twitter inspired case study

It’s All about Timing

“Real time” (< few Seconds)

Reasonably Quick (seconds - minutes)

Batch (hours/days)

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8

Page 9: Real Time Analytics for Big Data - A twitter inspired case study

It’s All about Timing

• Event driven / stream processing • High resolution – every tweet gets counted

• Ad-hoc querying • Medium resolution (aggregations)

• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9

This is what we’re here to discuss

Page 10: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10

It takes a week for users to

send 1 billion Tweets.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 11: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11

On average,

140 million tweets get sent every day.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 12: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved12

The highest throughput to date is

6,939 tweets/sec.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 13: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved13

460,000 new accounts

are created daily.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 14: Real Time Analytics for Big Data - A twitter inspired case study

14

5% of the users generate

75% of the content.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Twitter in Numbers

Source: http://www.sysomos.com/insidetwitter/

Page 15: Real Time Analytics for Big Data - A twitter inspired case study

Challenge 1 – Computing Reach

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved15

Page 16: Real Time Analytics for Big Data - A twitter inspired case study

Challenge 1 – Computing Reach

Count

Tweets Followers Distinct Followers

Reach

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved16

Page 17: Real Time Analytics for Big Data - A twitter inspired case study

Challenge 2 – Word Count

Word:Count

Tweets

Count?® Copyright 2011 Gigaspaces Ltd. All Rights Reserved17

• Hottest topics• URL mentions• etc.

Page 18: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved18

URL Mentions – Here’s One Use Case

Page 19: Real Time Analytics for Big Data - A twitter inspired case study

(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time

Aggregate counters for each word A few 10s of thousands of words (or hundreds of

thousands if we include URLs) System needs to linearly scale

Analyze the Problem

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved19

Page 20: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20

It’s Really Just an Online Map/Reduce

Page 21: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21

Event Driven Flow

Tokenize

StoreRaw

Filter Count

Store Counters

Research

Page 22: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22

(Tens) of thousands of tweets to tokenize every second, even more words to filter CPU bottleneck

Tens/hundreds of thousands of counters to update Counters contention

Tens/hundreds of thousands of counters to persist Database bottleneck

(Tens) of thousands of tweets to store every second in the database Database bottleneck

It’s Not That Simple…

Page 23: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23

Implementation Challenges

Twitter is growing by the day, how can this scale?

Fault tolerance. What happens if a server fails?

Consistency. Counters should be accurate!

Page 24: Real Time Analytics for Big Data - A twitter inspired case study

Solutions, Solutions

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24

Page 25: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25

We need a system that could scale linearly Event partitioning to the rescue! What to partition by?

Dealing with the Massive Tweet Stream

Tokenizer1 Filterer 1

Tokenizer2 Filterer 2

Tokenizer 3 Filterer 3

Tokenizer n Filterer n

Page 26: Real Time Analytics for Big Data - A twitter inspired case study

26

Why not keep things in memory? Treat it as a SoR

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Counters Persistence & Contention

Tokenizer1

Tokenizer2

Tokenizer 3

Tokenizer n

Filterer 1

Filterer 2

Filterer 3

Filterer n

Counter Updater 1

Counter Updater 2

Counter Updater 3

Counter Updater n

IMDG

Page 27: Real Time Analytics for Big Data - A twitter inspired case study

27

Why not keep things in memory (data grid)? Question: How to partition counters?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Counters Persistence & Contention

Tokenizer1

Tokenizer2

Tokenizer 3

Tokenizer n

Filterer 1

Filterer 2

Filterer 3

Filterer n

Counter Updater 1

Counter Updater 2

Counter Updater 3

Counter Updater n

IMDG

Page 28: Real Time Analytics for Big Data - A twitter inspired case study

28

Why not keep things in memory? Question: How to partition counters?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Counters Persistence & Contention

Tokenizer1

Tokenizer2

Tokenizer 3

Tokenizer n

Filterer 1

Filterer 2

Filterer 3

Filterer n

Counter Updater 1

Counter Updater 2

Counter Updater 3

Counter Updater n

IMDG

Facebook keeps 80% of its data in Memory (Stanford research)

RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec

Page 29: Real Time Analytics for Big Data - A twitter inspired case study

29

RDBMS won’t cut it (unless you have $$$$$) Hadoop / NoSQL to the rescue Use the data grid for batching and as a reliable

transaction log

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Database Bottleneck – Storing Raw Tweets

Persister 1

Persister 2

Persister n

Page 30: Real Time Analytics for Big Data - A twitter inspired case study

30

Primary/hot backup setup, sync replication Transactional, atomic updates “On Demand” backups

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Counter Consistency & Resiliency

IMDG

P

P

P

P

B

B

B

B

Page 31: Real Time Analytics for Big Data - A twitter inspired case study

31

Primary/hot backup setup, sync replication Transactional, atomic updates “On Demand” backups

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Counter Consistency & Resiliency

IMDG

P

P

P

P

B

B

B

B

P

Page 32: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved32

Implementing Event Flows

P B

TokenizerRaw FiltererTokenized CounterFiltered

Use the data grid as event bus Stateful objects as transactional events

Page 33: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33

Putting It All Together

IMDG

P

P

P

P

B

B

B

B

Page 34: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34

Putting It All Together

IMDG

P

P

P

P

B

B

B

B

• Use an IMDG • Partition events, counters• Collocate event handlers with data for low

latency and simple scaling • Use atomic updates to update counters in

memory • Persist asynchronously to a big data store

Page 35: Real Time Analytics for Big Data - A twitter inspired case study

Reducing Counter Contention - Optimization

Store processed tweets locally in each partition

Use periodic batch writes to update counters

Process batches as a whole

35

word1

word2

word3

1 sec

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35

Page 36: Real Time Analytics for Big Data - A twitter inspired case study

Keep Your Eyes Open…

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36

Page 37: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37

Learn and fork the code on github: https://github.com/Gigaspaces/rt-analytics

Detailed blog posthttp://bit.ly/gs-bigdata-analytics

Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html

Twitter Storm: http://bit.ly/twitter-storm

Apache S4http://incubator.apache.org/s4/

References

Page 38: Real Time Analytics for Big Data - A twitter inspired case study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38