On Real-Time Twitter Analysis.

On Real-Time Twitter Analysis

Mikio L. Braun http://blog.mikiobraun.de mikiobrauntwimpact UG (haftungsbeschränkt) http://twimpact.com

with Matthias Jugel thinkberg

Apache Hadoop Get Together, BerlinApril 28, 2012

Big Data and Data Science and Social Media

● There's a lot you can do with social media data

● Trend analysis (“trending topics”)

● Sentiment analysis

● Impact analysis (Klout, Kred, etc.)

● More general studies (diameter of network, distribution patterns, etc.)

● Types of data

● Event treams (Twitter stream)

● Graph data (user relationships, retweet networks)

● Text data (sentiment analysis, word clouds)

● URLs

● …

Social Media Streaming Data

● Examples● Twitter firehose/sprinkler● Click-through data● bit.ly URL resolution requests

● Some numbers:● up to a few thousand events per second● events are small up to a few kilobytes

Timestamp

Retweeting User

Retweeted User

Hashtag

User Mention

Keywords

TweetRetweeted Tweet

What's in a Tweet?

TWIMPACT - Retweet trends

● Trending by retweet activity● Robust matching of tweets even if shortened,

edited (slightly)● Compute trends for links, hashtags, URLs● Aggregate TWIMPACT score for users

How to scale stream processing?

History of approaches

● Started in June 2009● Free Twitter stream (capped at 50 tweets/s)

Language Storage backend

Stream mining + in memory

Version 1

Version 2

Version 3

Putting it all in a data base

● Insert millions of rows into data base

● Get reports by

● Hardly real-time. Also, data bases will become slower and slower...

SELECT *, COUNT(*) FROM eventsWHERE created_at > … AND created_at < …GROUP BY idORDER BY COUNT(*) DESCLIMIT 100;

NoSQL: Cassandra

● Structure: Families → Tables → Rows → Key Value pairs

● Easy clustering (peer-to-peer configuration)● Flexible consistency, read-repair, hinted

handoff, etc.● No locking, (in 0.6.x:) no support for indices,

counters → complete rewrite● Operations profile (about 50:50 read/write)

Cassandra: Multithreading

● Multithreading helps (but without locking support?)

Core i7,4 cores(2 + 2 HT)

Seconds

Cassandra: Configuration

Compaction

Memtables,indexes, etc.

Size of Memtable: 128M, JVM Heap: 3G, #CF: 12

Cassandra: Configuration

Compaction

“Big”GC

NoSQL/Cassandra - Summary

● Works quite well, faster than PostgreSQL (from 200 to 600 tps)

● Lack of locking/indices require a lot of manual management

● Configuration messy● 4 node cluster vs. single node:

Single node consistently 1.5 – 3 times faster!

● Ultimately, becomes slower and slower● Doesn't handle deletions gracefully

Stream processing frameworks

● Stream processing = scalable actor based concurrency

● For example:● Twitter's (backtype's) Storm https://github.com/nathanmarz/storm

● Yahoo's S4 http://incubator.apache.org/s4/

● Esper http://esper.codehaus.org/

● Streambase http://www.streambase.com

Stream processing- some thoughts

● Maximum throughput hard to estimate● Not everything can be parallelized● Scalable storage system still necessary● How to deal with failure/congestion?● Persistent messaging middleware not what you

might want.

The DataSift infrastructurehttp://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html

● C++, PHP, Java/Scala, Ruby

● MySQL on SSDs, HBase (30 nodes, 400TB), memcached, Redis for some queues

● 0MQ, Kafka (LinkedIn)

● 936 CPU cores

● Analyzes 250 million tweets per day

● Peak throughput: 120,000 t/s

● monitoring & accounting

ParseAugmentContent

CustomFilters Delivery

Throughput: 120,000 tweets per second

but: 120,000 / 936 = 128.2 tweets per second per core

Principles of Stream Processing

● Keep resource needs constant● Control maximum processing rates● Disks too slow, keep data in RAM

Stream mining

fixed number of slots

13r13t

● Focus on relevant data, discard the rest

● Provably approximates true counts

● Keep data in memory

Space Saving algorithm (Metwally, Agrawal, Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams”, International Conference on Database Theory, 2005.)

TWIMPACTReal-time Twitter Retweet Analysis

● Stream mining to keep “hot set” of few hundred thousand most active retweets in memory

● Secondary indices, bipartite graphs, object stores

● Write snapshots to disk for later analysis● Up to several thousand tweets per second

in single threaded operation.

2011 in Retweets

Our Analysis Pipeline

RetweetMatching

& Retweet TrendsSnapshots

Trends

Thread 1

Thread k

Tweets

synchronizedworker threads

single threaded

map reduce like

JSON parsing

Analyzing dependent trends(links/hashtags/etc.)

Most retweeted users

Most retweeted tweets

Social network buzz

● Many interesting challenges in social media.● Many different data types, including streams.● MapReduce doesn't really fit stream processing● You can't just scale into real-time● Principles of Stream Processing

● Bounded “hot set” of data in memory● Mine stream, discard irrelevant data

● Real world applications often include a mixture of multithreading, stream processing, map reduce and single thread stages.

Summary

On Real-Time Twitter Analysis.

Documents

Transcript of On Real-Time Twitter Analysis.

A real-time signal of brand equity and sales from Twitter data

Project Presentation : Real time distributed video transcoding using twitter storm and ffmpeg

Real-Time Disease Surveillance using Twitter Data ...users.eecs.northwestern.edu/~kml649/publication/kdd2013_old.pdf · Real-Time Disease Surveillance using Twitter Data: ... plotted

#TwitterRealTime - Real time processing @twitter

Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web Applications

Big Data in Real Time at Twitter

Earthquake Shakes Twitter User: Real-Time Event Detection By

Discover the Power Real-time Twitter Search & Database Marketing with Spider! #spiderQube

Twitter and Real Time Web

A Near-Real Time Application for Twitter Data Analysis

Shakes Twitter User: Analyzing Tweets for Real-Time Event Detection

Real-Time Bursty Topic Detection from Twitter

Defrag: Applying Twitter Analytics in Real Time

Real Time Planning Twitter

Real-Time Recruiting: The Role of Twitter in Tech Hiring

Semantic Twitter Analyzing Tweets For Real Time Event Notification

Twitter Heron: Stream Processing at Scalecis.csuohio.edu/~sschung/cis611/Twitter heron_updated.pdf · Twitter Heron: Stream Processing at Scale. TWITTER IS A REAL TIME. ABSTRACT ...

Real Time Analytics for Big Data a Twitter Case Study

Real-Time Sensing of Trafﬁc Information in Twitter …...Real-Time Sensing of Trafﬁc Information in Twitter Messages Sara Filipa Lemos de Carvalho Master in Informatics and Computing

Mining Twitter for Real-Time Trend and Information Discovery