Trend Analysis

22
Trend Analysis Yi-Chia Wang LTI 2nd year Master student Analysis of Social Media

description

Yi-Chia Wang LTI 2nd year Master student. Analysis of Social Media. Trend Analysis. Introduction. Document streams Arrive continuously over time E-mail, news articles, search engine query logs, … Identify topics in document streams Topic detection and tracking Text mining Visualization - PowerPoint PPT Presentation

Transcript of Trend Analysis

Page 1: Trend Analysis

Trend AnalysisYi-Chia Wang

LTI 2nd year Master student

Analysis of Social Media

Page 2: Trend Analysis

Oct-30 Analysis of Social Media 2007 2

Introduction• Document streams

• Arrive continuously over time• E-mail, news articles, search engine query logs, …

• Identify topics in document streams• Topic detection and tracking• Text mining • Visualization• …

• Is there a better organizing principle for the enormous archives of document streams?• Temporal information in document streams

Trausan-Matu et al., 2007

Page 3: Trend Analysis

Oct-30 Analysis of Social Media 2007 3

“Burst of activity”• Topics appear, grow in intensity for a period of

time, and then fade away.• Bursts correspond to points at which the intensity

of message arrivals increases sharply• Problems with naive identification of bursts

• Easily identifying large numbers of short bursts• Fragmenting long burst into many smaller ones

• Goal: identifying bursts only when they have sufficient intensity

Page 4: Trend Analysis

Jon KleinbergDepartment of Computer Science

Cornell University SIGKDD ‘02

Bursty and Hierarchical Structure in Streams

Page 5: Trend Analysis

Oct-30 Analysis of Social Media 2007 5

Two-state Automaton (A) Model• Idea: periods of lower

message intensity interleave with periods of higher message intensity

• A begins in state q0

• A changes state with probability p

• When in state q0, messages are emitted at a slow rate; when in state q1, messages are emitted at a faster rate

time

inte

nsity

q0 q0

q1 q1

q0

xexf 000

q0 q1

p

p

1-p 1-p

xexf 111

10

Page 6: Trend Analysis

Oct-30 Analysis of Social Media 2007 6

Exponential Distribution

• Modeling the message emission rate

• Modeling the time gap between messages and

• Modeling by exponential distribution with parameter being the rate of message arrivals

1iix

0 , xexf

Wikipedia

x

Page 7: Trend Analysis

Oct-30 Analysis of Social Media 2007 7

Two-state Automaton (A) Model• Formally, given:

• messages with specified arrival times• : inter-arrival gaps

• We want to determine the conditional probability of a state sequence

)1( n

q' q'

q

xq'Pr

xqPrx|qPr

f

f

nxxx ...,,x 21

n

t ti xf t1

nii qq ,...,q

1

ns transitiostate ofnumber theis

1

b

pp bnb

Page 8: Trend Analysis

Oct-30 Analysis of Social Media 2007 8

Two-state Automaton (A) Model• Finding a state sequence q maximizing the

probability

• Equivalently, minimizing the following cost function:

x|qPr argmax qq

*

n

tti xfp

pbc

t1

ln 1

lnx|q

Favoring sequences with a small number of state transitions

Favoring state sequences that conform well to the sequence x of gap values

Page 9: Trend Analysis

Oct-30 Analysis of Social Media 2007 9

Infinite-state Automata Model• Cost Function

n

tti

n

ttt xfiic

t1

1

01 ln , x|q

states. change

canautomaton which the

withease thecontrols

0 where

, 0

, ln

ij

ijnij

parameter scaling

a is 1 where

ˆ 1

s

sT

nsg ii

i

Page 10: Trend Analysis

Oct-30 Analysis of Social Media 2007 10

Computing a minimum-cost state sequence• THEOREM: If q* is an optimal state

sequence in , then it is also an optimal state sequence in

• Dynamic programming is used for searching an optimal state sequence

*,s

ks ,

Page 11: Trend Analysis

Oct-30 Analysis of Social Media 2007 11

Bursts exhibit a natural nested structure

A burst of intensity j is a maximal interval over which a part of state sequence is in a state of index j or higher

Bursts can also be represented as a tree. Each burst is a node in the tree

Page 12: Trend Analysis

Oct-30 Analysis of Social Media 2007 12

Experiments• The model makes sense for many datasets

(of an analogous flavor)• Email• Titles of conference papers• U.S. Presidential State of the Union

Addresses• Web clickstreams

Page 13: Trend Analysis

Oct-30 Analysis of Social Media 2007 13

Email Dataset• Is the appearance of messages containing

particular words exhibits a burst in the vicinity of significant times such as deadlines?

• Author’s own collection of email• June 9, 1997 – August 23, 2001• 34344 messages (41.7 MB)• Focusing on the response set

Page 14: Trend Analysis

Oct-30 Analysis of Social Media 2007 14

Results for the Word - ITR• ITR is the name of a large

NSF program• The author wrote 2

proposals for it in 1999-2000; one is a small proposal while another is a large one

• The intervals are annotated with the first and last dates of the messages

• The first subtree splits further into 2 subtrees

• For the 2nd subtree, there is no burst since the author did not continue the submission

Page 15: Trend Analysis

Oct-30 Analysis of Social Media 2007 15

Results for the Word - prelim• Prelim is the term used at

Cornell for non-final exams• The author taught courses in 4

of the 8 semesters covered by the collection of email, and each of these courses had 2 prelims

• For the first of these courses, there was a special course email account

• For remaining 3 courses, each corresponds to a long burst and 2 shorter, more intense bursts for the particular prelims

The 2 structures suggest how a large folder of email might naturally be divided into a hierarchical set of sub-folders around certain key events, based only on the rate of message arrivals

Page 16: Trend Analysis

Oct-30 Analysis of Social Media 2007 16

Titles of Conference Papers• Goal: extracting bursts in term usage from

the titles of conference papers over the past several decades

• Problem: conference papers arrive in discrete batches every half or one year no message inter-arrivals gaps

• Modified automaton model:• Generating batched arrivals• For each state, there is an expected

fraction of relevant documents• Bursty is identified if the fraction of

relevant documents increases

Page 17: Trend Analysis

Oct-30 Analysis of Social Media 2007 17

Titles of Conference Papers• Cost function for each arrival batch:

• The weight of the burst : the improvement in cost by using state q1 rather than state q0

t

t

rdi

ri

t

ttt

d

rn

ppr

ddri ttt

of totala ofout

documentsrelevant containsbatch where

1ln,,

th

2

1

,,1,,0t

tttttt drdr

21, tt

Page 18: Trend Analysis

Oct-30 Analysis of Social Media 2007 18

SIGMOD & VLDB, 1975-2001• Considering each

word in paper titles• The 30 bursts of

highest weight• The bursts with no

ending date the interval extends to the most recent conference

• These bursty words are different from a list of common words

• The bursts are picking up trend in language use

Page 19: Trend Analysis

Oct-30 Analysis of Social Media 2007 19

STOC & FOCS, 1969-2001

• The 30 bursts of highest weight

• Particular titling conventions that were in fashion for certain periods• “How to construct

random functions”

• …

Page 20: Trend Analysis

Oct-30 Analysis of Social Media 2007 20

U.S. Presidential State of the Union Addresses

Kleinbergh, SIGKDD ‘02

Page 21: Trend Analysis

Oct-30 Analysis of Social Media 2007 21

Web usage data – clickstreams• Settings:

• 80 undergraduate students• Two and a half months in Spring 2000• For every URL w, all bursts in the stream of

visits to w are determined• Focusing on high-weighted bursts as well as

those that involve at least 10 distinct users• Results:

• High-ranked bursts involve the URLs of the online class reading assignments, centered on intervals shortly before and during the weekly sessions at which they were discussed

Page 22: Trend Analysis

Oct-30 Analysis of Social Media 2007 22

Conclusions• Modeling streams using an infinite-state

automaton• State transitions lead to bursts• First story detection: a single message on

which the associated state transition occurred• The model offers a means of structuring the

information from our patterns of interacting and communicating

• Document streams have a strong temporal character

• In many domains, we are accumulating detailed records of our own communication and behavior