On Real-Time Twitter Analysis.

Post on 14-Oct-2014

2.164 views 3 download

description

Slides for the talk given at the Apache Hadoop Get Together in Berlin on April 18, 2012.

Transcript of On Real-Time Twitter Analysis.

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

On Real-Time Twitter Analysis

Mikio L. Braun http://blog.mikiobraun.de mikiobrauntwimpact UG (haftungsbeschränkt) http://twimpact.com

with Matthias Jugel thinkberg

Apache Hadoop Get Together, BerlinApril 28, 2012

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Big Data and Data Science and Social Media

● There's a lot you can do with social media data

● Trend analysis (“trending topics”)

● Sentiment analysis

● Impact analysis (Klout, Kred, etc.)

● More general studies (diameter of network, distribution patterns, etc.)

● Types of data

● Event treams (Twitter stream)

● Graph data (user relationships, retweet networks)

● Text data (sentiment analysis, word clouds)

● URLs

● …

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Social Media Streaming Data

● Examples● Twitter firehose/sprinkler● Click-through data● bit.ly URL resolution requests

● Some numbers:● up to a few thousand events per second● events are small up to a few kilobytes

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Timestamp

Retweeting User

Retweeted User

Hashtag

Link

User Mention

Keywords

TweetRetweeted Tweet

What's in a Tweet?

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

TWIMPACT - Retweet trends

● Trending by retweet activity● Robust matching of tweets even if shortened,

edited (slightly)● Compute trends for links, hashtags, URLs● Aggregate TWIMPACT score for users

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

How to scale stream processing?

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

History of approaches

● Started in June 2009● Free Twitter stream (capped at 50 tweets/s)

Language Storage backend

Stream mining + in memory

Version 1

Version 2

Version 3

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Putting it all in a data base

● Insert millions of rows into data base

● Get reports by

● Hardly real-time. Also, data bases will become slower and slower...

SELECT *, COUNT(*) FROM eventsWHERE created_at > …       AND created_at < …GROUP BY idORDER BY COUNT(*) DESCLIMIT 100;

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

NoSQL: Cassandra

● Structure: Families → Tables → Rows → Key Value pairs

● Easy clustering (peer-to-peer configuration)● Flexible consistency, read-repair, hinted

handoff, etc.● No locking, (in 0.6.x:) no support for indices,

counters → complete rewrite● Operations profile (about 50:50 read/write)

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Cassandra: Multithreading

● Multithreading helps (but without locking support?)

1

24

816

32

64

Core i7,4 cores(2 + 2 HT)

Seconds

Tw

eets

per

sec

ond

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Cassandra: Configuration

Flush

Compaction

Memtables,indexes, etc.

Size of Memtable: 128M, JVM Heap: 3G, #CF: 12

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Cassandra: Configuration

Compaction

“Big”GC

Tw

eets

per

sec

ond

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

NoSQL/Cassandra - Summary

● Works quite well, faster than PostgreSQL (from 200 to 600 tps)

● Lack of locking/indices require a lot of manual management

● Configuration messy● 4 node cluster vs. single node:

Single node consistently 1.5 – 3 times faster!

● Ultimately, becomes slower and slower● Doesn't handle deletions gracefully

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Stream processing frameworks

● Stream processing = scalable actor based concurrency

● For example:● Twitter's (backtype's) Storm https://github.com/nathanmarz/storm

● Yahoo's S4 http://incubator.apache.org/s4/

● Esper http://esper.codehaus.org/

● Streambase http://www.streambase.com

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Stream processing- some thoughts

● Maximum throughput hard to estimate● Not everything can be parallelized● Scalable storage system still necessary● How to deal with failure/congestion?● Persistent messaging middleware not what you

might want.

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

The DataSift infrastructurehttp://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html

● C++, PHP, Java/Scala, Ruby

● MySQL on SSDs, HBase (30 nodes, 400TB), memcached, Redis for some queues

● 0MQ, Kafka (LinkedIn)

● 936 CPU cores

● Analyzes 250 million tweets per day

● Peak throughput: 120,000 t/s

● monitoring & accounting

ParseAugmentContent

CustomFilters Delivery

Throughput: 120,000 tweets per second

but: 120,000 / 936 = 128.2 tweets per second per core

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Principles of Stream Processing

● Keep resource needs constant● Control maximum processing rates● Disks too slow, keep data in RAM

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Stream mining

asd

fixed number of slots

42

37

25

qwe

13r13t

erqew

erq

fgsa

gwth

5z3

wet

13

20

17

10

7

4

erq

qer

qer 5

● Focus on relevant data, discard the rest

● Provably approximates true counts

● Keep data in memory

Space Saving algorithm (Metwally, Agrawal, Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams”, International Conference on Database Theory, 2005.)

21

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

TWIMPACTReal-time Twitter Retweet Analysis

● Stream mining to keep “hot set” of few hundred thousand most active retweets in memory

● Secondary indices, bipartite graphs, object stores

● Write snapshots to disk for later analysis● Up to several thousand tweets per second

in single threaded operation.

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

2011 in Retweets

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

2011 in Retweets

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Our Analysis Pipeline

RetweetMatching

& Retweet TrendsSnapshots

Day 1

Day 2

Day n

Trends

Thread 1

Thread k

Tweets

synchronizedworker threads

single threaded

map reduce like

JSON parsing

Analyzing dependent trends(links/hashtags/etc.)

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Most retweeted users

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Most retweeted tweets

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

Social network buzz

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

● Many interesting challenges in social media.● Many different data types, including streams.● MapReduce doesn't really fit stream processing● You can't just scale into real-time● Principles of Stream Processing

● Bounded “hot set” of data in memory● Mine stream, discard irrelevant data

● Real world applications often include a mixture of multithreading, stream processing, map reduce and single thread stages.

Summary