Real Time Analytics for Big Data - A twitter inspired case study
-
Upload
uri-cohen -
Category
Technology
-
view
2.270 -
download
1
description
Transcript of Real Time Analytics for Big Data - A twitter inspired case study
Real Time Analytics for Big DataA Twitter Case Study
Uri Cohen Head of Product
@uri1803
Big Data Predictions
“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
2® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
We’re Living in a Real Time World…Google RT Web Analytics
Google Real Time Search
Facebook Real Time Social Analytics
Twitter paid tweet analytics
Real Time User Engagement
New Real Time Analytics Startups
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved3
3 Flavors to Analytics
Counting Correlating Research
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4
Analytics @ Twitter – Counting
How many signups, tweets, retweets for a topic?
What’s the average latency?
Demographics Countries and cities Gender Age groups Device types …
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved5
Analytics @ Twitter – Correlating
What devices fail at the same time?
What features get user hooked?
What places on the globe are “happening”?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6
Analytics @ Twitter – Research
Sentiment analysis “Obama is popular”
Trends “People like to tweet
after watching American Idol”
Spam patterns How can you tell when
a user spams?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7
It’s All about Timing
“Real time” (< few Seconds)
Reasonably Quick (seconds - minutes)
Batch (hours/days)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8
It’s All about Timing
• Event driven / stream processing • High resolution – every tweet gets counted
• Ad-hoc querying • Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9
This is what we’re here to discuss
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10
It takes a week for users to
send 1 billion Tweets.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11
On average,
140 million tweets get sent every day.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved12
The highest throughput to date is
6,939 tweets/sec.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved13
460,000 new accounts
are created daily.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
14
5% of the users generate
75% of the content.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers
Source: http://www.sysomos.com/insidetwitter/
Challenge 1 – Computing Reach
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved15
Challenge 1 – Computing Reach
Count
Tweets Followers Distinct Followers
Reach
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved16
Challenge 2 – Word Count
Word:Count
Tweets
Count?® Copyright 2011 Gigaspaces Ltd. All Rights Reserved17
• Hottest topics• URL mentions• etc.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved18
URL Mentions – Here’s One Use Case
(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time
Aggregate counters for each word A few 10s of thousands of words (or hundreds of
thousands if we include URLs) System needs to linearly scale
Analyze the Problem
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved19
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20
It’s Really Just an Online Map/Reduce
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21
Event Driven Flow
Tokenize
StoreRaw
Filter Count
Store Counters
Research
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22
(Tens) of thousands of tweets to tokenize every second, even more words to filter CPU bottleneck
Tens/hundreds of thousands of counters to update Counters contention
Tens/hundreds of thousands of counters to persist Database bottleneck
(Tens) of thousands of tweets to store every second in the database Database bottleneck
It’s Not That Simple…
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23
Implementation Challenges
Twitter is growing by the day, how can this scale?
Fault tolerance. What happens if a server fails?
Consistency. Counters should be accurate!
Solutions, Solutions
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25
We need a system that could scale linearly Event partitioning to the rescue! What to partition by?
Dealing with the Massive Tweet Stream
Tokenizer1 Filterer 1
Tokenizer2 Filterer 2
Tokenizer 3 Filterer 3
Tokenizer n Filterer n
26
Why not keep things in memory? Treat it as a SoR
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Counters Persistence & Contention
Tokenizer1
Tokenizer2
Tokenizer 3
Tokenizer n
Filterer 1
Filterer 2
Filterer 3
Filterer n
Counter Updater 1
Counter Updater 2
Counter Updater 3
Counter Updater n
IMDG
27
Why not keep things in memory (data grid)? Question: How to partition counters?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Counters Persistence & Contention
Tokenizer1
Tokenizer2
Tokenizer 3
Tokenizer n
Filterer 1
Filterer 2
Filterer 3
Filterer n
Counter Updater 1
Counter Updater 2
Counter Updater 3
Counter Updater n
IMDG
28
Why not keep things in memory? Question: How to partition counters?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Counters Persistence & Contention
Tokenizer1
Tokenizer2
Tokenizer 3
Tokenizer n
Filterer 1
Filterer 2
Filterer 3
Filterer n
Counter Updater 1
Counter Updater 2
Counter Updater 3
Counter Updater n
IMDG
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec
29
RDBMS won’t cut it (unless you have $$$$$) Hadoop / NoSQL to the rescue Use the data grid for batching and as a reliable
transaction log
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Database Bottleneck – Storing Raw Tweets
Persister 1
Persister 2
Persister n
30
Primary/hot backup setup, sync replication Transactional, atomic updates “On Demand” backups
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Counter Consistency & Resiliency
IMDG
P
P
P
P
B
B
B
B
31
Primary/hot backup setup, sync replication Transactional, atomic updates “On Demand” backups
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Counter Consistency & Resiliency
IMDG
P
P
P
P
B
B
B
B
P
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved32
Implementing Event Flows
P B
TokenizerRaw FiltererTokenized CounterFiltered
Use the data grid as event bus Stateful objects as transactional events
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33
Putting It All Together
IMDG
P
P
P
P
B
B
B
B
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34
Putting It All Together
IMDG
P
P
P
P
B
B
B
B
• Use an IMDG • Partition events, counters• Collocate event handlers with data for low
latency and simple scaling • Use atomic updates to update counters in
memory • Persist asynchronously to a big data store
Reducing Counter Contention - Optimization
Store processed tweets locally in each partition
Use periodic batch writes to update counters
Process batches as a whole
35
word1
word2
word3
1 sec
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35
Keep Your Eyes Open…
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37
Learn and fork the code on github: https://github.com/Gigaspaces/rt-analytics
Detailed blog posthttp://bit.ly/gs-bigdata-analytics
Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html
Twitter Storm: http://bit.ly/twitter-storm
Apache S4http://incubator.apache.org/s4/
References
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38