Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

42
Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007

Transcript of Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Page 1: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

1

Analysis of Social Media

Trend Analysis

Mohit KumarOct 31, 2007

Page 2: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

2

Roadmap• The Predictive Power of Online

Chatter – Gruhl et al, KDD’05• Topics over Time: A Non-Markov

Continuous-Time Model of Topical Trends – Wand and McCallum, KDD’06

• Briefly: • Visualizing Tags over Time – Dubinko et

al, WWW’06

Page 4: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

4

Motivation

• Demonstration of link between online content (blogs) and customer behavior (purchase decision)

• Predict spikes in sales rank based on online chatter

Page 5: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

5

Domain

• Sales rank of books on Amazon.com• Postings in blogs, media and

webpages

Page 6: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

6

Findings

• Hand-crafted queries produce matching posts whose volume predict sales rank

• These queries can be automatically generated

• Successfully predict spikes in sales rank (not general sales rank motion)

Page 7: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

7

Causation• Bloggers are most likely non-causative

indicators of other root-cause (typically an event in the outside world)

• Possible explanations for delay between postings and changes in sales rank, linked to profiling of bloggers:• Forward thinking people who write and buy

early but represent only a small fraction of population

• Representative of the population but threshold to write about a product may be lower than buying the product

Page 8: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

8

Data Details• IBM Web Fountain

• 300K blogs• 200K postings per day• 3B web pages• 200K media articles per day (Factiva media

feed)• Amazon sales rank data

• 2430 books• 480K sales rank readings• Duration – 120 days

Page 9: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

9

Correlation between Sales Rank and Blog Mentions

• Spike: All the ranks that do not occur within a week of the minimum rank are “large enough”

• Large enough ~ max(m+50,1.5m)

• 50 books contains ‘spikes’ during the considered time interval of study

Page 10: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

10

• Spike corresponds to Armstrong winning Tour de France on July 25th

Query: Lance Armstrong OR Tour de France

Example 1

Page 11: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

11

• Two plausible factors:• Message board

corresponding to a TV show starting taking style questions

• Authors had another book release

Query: What not to wear

Example 2

Page 12: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

12

• Spike coincides with movie released on Sep 1

Query: Vanity Fair OR William Thackeray

Example 3

Page 13: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

13

Cross correlation

Query: Lance Armstrong OR Tour de France

• Leading Best Lag

Query: What not to wear

• Leading Best Lag

Query: Vanity Fair OR William Thackeray

• Slightly Trailing Best Lag

• Out of 50 books with spikes, 10 have highly correlated blog mentions

Page 14: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

14

Correlation between Sales Rank and Blog Mentions

Page 15: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

15

Factors affecting sales rank but not Blog mentions

• Marketing promotions• Book release (filtered out books

released in 2004 from their dataset)• Wholesale purchase• Lower ranking books get spikes but

not public attention

Page 16: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

16

• High point following the release of the film – Jun 28’04

• Steadily falling

Query: The Notebook AND Nicholas Sparks

Separate Example - Trend

Page 17: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

17

Findings

• Hand-crafted queries produce matching posts whose volume predict sales rank

• These queries can be automatically generated

• Successfully predict spikes in sales rank (not general sales rank motion)

Page 18: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

18

Automatically generated query• Query generation

based on author name• Fairly simple• Needs more

exploration

Query: B

uste

r Oln

ey

Page 19: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

19

Automatically generated query

Scatter plot of cross-correlation versus lag for 182 automatically-generated queries.

Page 20: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

20

Findings

• Hand-crafted queries produce matching posts whose volume predict sales rank

• These queries can be automatically generated

• Successfully predict spikes in sales rank (not general sales rank motion)

Page 21: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

21

Problem statement

• Given• Time series representing sales rank upto

time t• Is

• Addition of blog mention data• Helpful in predicting

• Sales rank future trend?

Page 22: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

22

Predicting Motion (Sales up or down)

• 2-class classification problem• Natural predictors

• Moving averages• 63 % accuracy – best classfier

• Least squares predictors• 60 % accuracy – best classifier

• Markov predictor• 63% accuracy

• Not explicitly mentioned how the Blog mention data is used

Page 23: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

23

Predicting Volatility (difference in sales by a threshold)

• Threshold chosen to indicate volatility – 44

• 72% accuracy – Best classifier• Not explicitly mentioned how the Blog

mention data is used

Page 24: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

24

Predicting Spikes

• Create labeled data for evaluation• Tag spikes

• Problem:• Given: Product, time t, blog mentions• Output: spike in near future (binary

classification)

Page 25: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

25

Predicting Spikes – Heuristic Algorithm

• Three principles for finding spikes:• Biggest ever• Exceed historical averages significantly• Rise relatively quickly

• Translates into a function with 3 linear equations

Page 26: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

26

Evaluation• Not treated as binary classification at time

instance• Treated as predicting Leading & Trailing

predictions (2 week window)• Fairly weak/complicated evaluation

• 2/3 of the predictions made have Leading/Trailing predictions so may be an accuracy of 66%

• Recall ~ 0.5

Page 27: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

27

Conclusion

• Online chatter ‘may’ represent early indicator of real-world behavior

Page 28: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

28

Critique

• Preliminary work in exploring a significant/important problem

• Good roadmap for future research by decomposing the problem as follows:• Get ‘relevant’ blog mentions• Correlate mentions with sales rank

Page 29: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

29

Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends

KDD’06

Xuerui Wand U.Mass, AmherstAndrew McCallum U.Mass, Amherst

Acknowledgement: Slides borrowed from Linda Buisman, Australian National University

Page 30: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

30

Motivation• Information retrieval & text mining• Text is highly-dimensional• Topic models

• Discover summaries of documents• Reduce dimensions• Model co-occurrences of words

• mouse, cat, Tweety -> cartoons• mouse, keyboard -> computer supplies

• Topics over time• Co-occurrences are dynamic• Additional modality – time

• united, states, war @ 1850 -> Mexican-American War• united, states, war @ 1918 -> World War I• united, states, war @ 2006 -> War in Iraq

Page 31: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

31

Modeling time

• Earlier approaches• Discretize

• Fixed interval size does not fit all topics• Markov model

• State at time t+1 depends on t, but not earlier

• Solution• Treat time as a continuous variable• Time is a parameter in a Bayesian

network

Page 32: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

32

Bayesian network• Generative model

• vs discriminative (SVM, NN, …)• Bayes’ rule:• Bayesian network

• Directed graph of parameters• A connected to B:

• Probability of B conditionally depends on A• Generation step

• Estimate conditional probabilities for all (hidden) parameters

• Goal• Predict probability of hypothesis H being true for

observation X

Page 33: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

33

Topics-over-time model• Based on an earlier topic model LDA• “Bag-of-words” approach

• Word count in a document is significant• Position and order are not significant

• Timestamp of document becomes another parameter

• Generate Bayesian network from existing documents• Exact inference computationally infeasible• Use approximate inference

• Goal• Predict the probability of a document belonging to topic

T

Page 34: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

34

Model

Page 35: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

35

Results

Words associated with a topic

Distribution of topic over time

Page 36: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

36

Comparison with LDA

Confuses Mexican war with WWI

TOT

LDA

Confuses Panama Canal with other activites in Central America

Page 37: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

37

KL Divergence between topics

TOT topics are more distinct from each other

Page 38: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

38

Time Prediction

Task: predict the decade given the text of the SoU Address

Page 39: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

39

Topic Distribution Profile

NIPS dataset

Emphasis on Neural networks, analog circuits and cells

Emphasis on SVMs, Optimization, Probability and Inference

Page 40: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

40

Topic Co-occurrences over time

Co-occurrence of topics with the “classification’ topic in NIPS dataset

Page 41: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

41

Analysis

• Generative vs discriminative methods• Discriminative usually faster• Accuracy depends on application• Generative model offers more

information• E.g. not just topic(s) of a document, but

also:• Predict time-stamp, given a document• Distribution of topics over time

Page 42: Language Technologies Institute 1 Analysis of Social Media Trend Analysis Mohit Kumar Oct 31, 2007.

Language Technologies Institute

42

Analysis (cont)

• Limitations and simplifications• “Bag-of-words” instead of word

sequences or phrases• Computer science vs computer, science

• No account of position within document• Title, introduction, body, footnote