An N-gram Topic Model for Time-Stamped Documentsmsjameel/presentations/ecir... ·  · 2014-08-15An...

38
An N-gram Topic Model for Time-Stamped Documents Shoaib Jameel and Wai Lam The Chinese University of Hong Kong Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Transcript of An N-gram Topic Model for Time-Stamped Documentsmsjameel/presentations/ecir... ·  · 2014-08-15An...

An N-gram Topic Model for Time-StampedDocuments

Shoaib Jameel and Wai Lam

The Chinese University of Hong Kong

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Outline

Introduction and MotivationI The Bag-of-Words (BoW) assumptionI Temporal nature of data

Related WorkI Temporal Topic ModelsI N-gram Topic Models

Overview of our modelI Background

F Topics Over Time (TOT) Model - proposed earlierF Our proposed n-gram model

Empirical EvaluationConclusions and Future Directions

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

The ‘popular’ Bag-of-Words AssumptionMany works in the topic modeling literature assumeexchangeability among the words.As a result generate ambiguous words in topics.For example, consider few topics obtained from the NIPScollection using the Latent Dirichlet Allocation (LDA) model:

Example

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5architecture order connectionist potential priorrecurrent first role membrane bayesiannetwork second binding current datamodule analysis structures synaptic evidencemodules small distributed dendritic experts

The problem with the LDA modelWords in topics are not insightful.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

The ‘popular’ Bag-of-Words AssumptionMany works in the topic modeling literature assumeexchangeability among the words.As a result generate ambiguous words in topics.For example, consider few topics obtained from the NIPScollection using the Latent Dirichlet Allocation (LDA) model:

Example

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5architecture order connectionist potential priorrecurrent first role membrane bayesiannetwork second binding current datamodule analysis structures synaptic evidencemodules small distributed dendritic experts

The problem with the LDA modelWords in topics are not insightful.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

The problem with the bag-of-wordsassumption

1 Logical structure of the document is lost. For example, we do notknow whether “the cat saw a dog or a dog saw a cat”.

2 The computational models cannot tap an extra word orderinformation inherent in the text. Therefore, affects theperformance.

3 The usefulness of maintaining the word order has also beenillustrated in Information Retrieval, Computational Linguistics andmany other fields.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

The problem with the bag-of-wordsassumption

1 Logical structure of the document is lost. For example, we do notknow whether “the cat saw a dog or a dog saw a cat”.

2 The computational models cannot tap an extra word orderinformation inherent in the text. Therefore, affects theperformance.

3 The usefulness of maintaining the word order has also beenillustrated in Information Retrieval, Computational Linguistics andmany other fields.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

The problem with the bag-of-wordsassumption

1 Logical structure of the document is lost. For example, we do notknow whether “the cat saw a dog or a dog saw a cat”.

2 The computational models cannot tap an extra word orderinformation inherent in the text. Therefore, affects theperformance.

3 The usefulness of maintaining the word order has also beenillustrated in Information Retrieval, Computational Linguistics andmany other fields.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Why capture topics over time?

1 We know that data evolves over time.2 What people are talking today may not be talking tomorrow or an

year after.

Burj Khalifa

VolcanoManila Hostage

Iraq War

Year-2010

Wikipedia

N.Z Earthquake

Osama bin LadenHiggs Boson

Year-2011 Year-2012

Gaza Strip

Sachin Tendulkar

China

Apple Inc.

3 Models such as LDA do not capture such temporal characteristicsin data.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Related WorkTemporal Topic Models

Discrete time assumption modelsI Blei et al., (David M. Blei and John D. Lafferty. 2006.) - Dynamic

Topic Models - assume that topics in one year are dependent onthe topics of the previous year.

I Knights et al., (Knights, D., Mozer, M., and Nicolov, N. 2009.) -Compound Topic Model - Train a topic model on the most recent Kmonths of data.

The problem hereOne needs to select an appropriate time slice value manually. Thequestion is which time slice be chosen: day, month, year, etc.?

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Related WorkTemporal Topic Models

Discrete time assumption modelsI Blei et al., (David M. Blei and John D. Lafferty. 2006.) - Dynamic

Topic Models - assume that topics in one year are dependent onthe topics of the previous year.

I Knights et al., (Knights, D., Mozer, M., and Nicolov, N. 2009.) -Compound Topic Model - Train a topic model on the most recent Kmonths of data.

The problem hereOne needs to select an appropriate time slice value manually. Thequestion is which time slice be chosen: day, month, year, etc.?

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Related WorkTemporal Topic Models

Continuous Time Topic ModelsI Noriaki (Noriaki Kawamae. 2011.) - Trend Analysis Model - The

model has a probability distribution over temporal words, topics,and a continuous distribution over time.

I Uri et al., (Uri Nodelman, Christian R. Shelton, and Daphne Koller.2002.) - Continuous Time Bayesian Networks - Builds a graphwhere each variable lies in the node whose values change overtime.

The problem with the above modelsAll assume the notion of exchangeability and thus lose importantcollocation information inherent in the document.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Related WorkTemporal Topic Models

Continuous Time Topic ModelsI Noriaki (Noriaki Kawamae. 2011.) - Trend Analysis Model - The

model has a probability distribution over temporal words, topics,and a continuous distribution over time.

I Uri et al., (Uri Nodelman, Christian R. Shelton, and Daphne Koller.2002.) - Continuous Time Bayesian Networks - Builds a graphwhere each variable lies in the node whose values change overtime.

The problem with the above modelsAll assume the notion of exchangeability and thus lose importantcollocation information inherent in the document.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Related WorkN-gram Topic Models

1 Wallach’s (Hanna M. Wallach. 2006.) bigram topic model.Maintains word order during topic generation process. Generatesonly bigram words in topics.

2 Griffiths et al. (Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B.2007.) - LDA Collocation Model. Introduced binary randomvariables which decides when to generate a unigram or a bigram.

3 Wang et al. (Wang, X., McCallum, A., and Wei, X. 2007.) - TopicalN-gram Model - Extends the LDA Collocation Model. Gives topicassignment to every word in the phrase.

The problem with the above modelsCannot capture the temporal dynamics in data.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Related WorkN-gram Topic Models

1 Wallach’s (Hanna M. Wallach. 2006.) bigram topic model.Maintains word order during topic generation process. Generatesonly bigram words in topics.

2 Griffiths et al. (Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B.2007.) - LDA Collocation Model. Introduced binary randomvariables which decides when to generate a unigram or a bigram.

3 Wang et al. (Wang, X., McCallum, A., and Wei, X. 2007.) - TopicalN-gram Model - Extends the LDA Collocation Model. Gives topicassignment to every word in the phrase.

The problem with the above modelsCannot capture the temporal dynamics in data.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Topics Over Time (TOT) (Wang et al., 2006)1 Our model extends from this model.2 Assumes the notion of word and topic exchangeability.

Generative Process1 Draw T multinomials φz from a

Dirichlet Prior β, one for eachtopic z

2 For each document d , draw amultinomial θ(d) from a Dirichletprior α; then for each word w (d)

i inthe document d

1 Draw a topic zdi from

Multinomial θ(d)

2 Draw a word w (d)i from

Multinomial φz(d)

i

3 Draw a timestamp t (d)i from

Beta Ωz(d)

i

Topics Over Time Model (TOT)

I

w t

z

θ

α

Ωφ

β

DNd

T

Fig. 1. TOT modelShoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Topics Over Time Model (TOT)

1 The model assumes a continuous distribution over timeassociated with each topic.

2 Topics are responsible for generating both observed time-stampsand also words.

3 The model does not capture the sequence of state changes with aMarkov assumption.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Topics Over Time Model (TOT)

1 The model assumes a continuous distribution over timeassociated with each topic.

2 Topics are responsible for generating both observed time-stampsand also words.

3 The model does not capture the sequence of state changes with aMarkov assumption.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Topics Over Time Model (TOT)

1 The model assumes a continuous distribution over timeassociated with each topic.

2 Topics are responsible for generating both observed time-stampsand also words.

3 The model does not capture the sequence of state changes with aMarkov assumption.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Topics Over Time Model (TOT)Posterior Inference

1 In the Gibbs sampling, compute the conditional:

P(z(d)i |w, t, z

(d)i , α, β,Ω) (1)

2 We can thus write the updating equations as:

P(z(d)i |w, t, z

(d)i , α, β,Ω) ∝

(m

z(d)i

+ αz(d)

i− 1)

×n

z(d)i w (d)

i+ β

w (d)i− 1∑W

v=1(nz(d)

i v+ βv )− 1

× (1− t(d)i )

Ωz(d)i 1−1

t(d)Ω

z(d)i 2−1

iB(Ω

z(d)i 1

,Ωz(d)

i 2)

(2)

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Our ModelN-gram Topics Over Time Model

1 The model assumes a continuous distribution over timeassociated with each topic.

2 Topics are responsible for generating both observed time-stampsand also words.

3 The model does not capture the sequence of state changes with aMarkov assumption.

4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Our ModelN-gram Topics Over Time Model

1 The model assumes a continuous distribution over timeassociated with each topic.

2 Topics are responsible for generating both observed time-stampsand also words.

3 The model does not capture the sequence of state changes with aMarkov assumption.

4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Our ModelN-gram Topics Over Time Model

1 The model assumes a continuous distribution over timeassociated with each topic.

2 Topics are responsible for generating both observed time-stampsand also words.

3 The model does not capture the sequence of state changes with aMarkov assumption.

4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Our ModelN-gram Topics Over Time Model

1 The model assumes a continuous distribution over timeassociated with each topic.

2 Topics are responsible for generating both observed time-stampsand also words.

3 The model does not capture the sequence of state changes with aMarkov assumption.

4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Our ModelN-gram Topics Over Time Model

1 The model assumes a continuous distribution over timeassociated with each topic.

2 Topics are responsible for generating both observed time-stampsand also words.

3 The model does not capture the sequence of state changes with aMarkov assumption.

4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Our ModelN-gram Topics Over Time Model

1 The model assumes a continuous distribution over timeassociated with each topic.

2 Topics are responsible for generating both observed time-stampsand also words.

3 The model does not capture the sequence of state changes with aMarkov assumption.

4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Graphical ModelN-gram Topics Over Time Model

I

α

θ

ti−1 ti ti+1xi xi+1

zi−1 zi zi+1

wi−1 wi wi+1D

TWψγ

T

φβ δ σTW

Ω

xi+2

Fig. 1. Our model

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Generative ProcessN-gram Topics Over Time Model

Draw Discrete(φz) from Dirichlet(β) for each topic z;Draw Bernoulli(ψzw ) from Beta(γ) for each topic z and each word w ;Draw Discrete(σzw ) from Dirichlet(δ) for each topic z and each wordw ;For every document d , draw Discrete(θ(d)) from Dirichlet(α);foreach word w (d)

i in document d doDraw x (d)

i from Bernoulli(ψz(d)

i−1w (d)i−1

);

Draw z(d)i from Discrete(θ(d));

Draw w (d)i from Discrete(σ

z(d)i w (d)

i−1) if x (d)

i = 1;

Otherwise, Draw w (d)i from Discrete(φ

z(d)i

);

Draw a time-stamp t(d)i from Beta(Ω

z(d)i

);

end

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Posterior InferenceCollapsed Gibbs Sampling

P(z(d)i , x (d)

i |w, t, x(d)¬i , z

(d)¬i , α, β, γ, δ,Ω) ∝

(γxi

(d) + pz(d)

i−1w(d)i−1xi

− 1)×(α

z(d)i

+ qdz(d)

i− 1)×

(1− t(d)i )

Ωz(d)i 1−1

t(d)Ω

z(d)i 2−1

iB(Ω

z(d)i 1

,Ωz(d)

i 2)

×

βw(d)

i+n

z(d)i w(d)

i −1∑Wv=1(βv +n

z(d)i v

)−1if x (d)

i = 0

δw(d)

i+m

z(d)i w(d)

i−1w(d)i−1∑W

v=1(δv +mz(d)i w(d)

i−1v)−1

if x (d)i = 1

(3)

Posterior Estimates

θ(d)z =

αz + qdz∑Tt=1(αt + qdt )

(4) φzw =βw + nzw∑W

v=1(βv + nzv )(5) ψzwk =

γk + pzwk∑1k=0(γk + pzwk )

(6)

σzwv =δv + mzwv∑W

v=1(δv + mzwv )(7) Ωz1 = tz

( tz (1− tz )

s2z

− 1)

(8) Ωz2 = (1− tz )( tz (1− tz )

s2z

− 1)

(9)

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Inference AlgorithmInput : γ, δ, α,T , β,Corpus,MaxIterationOutput: Topic assignments for all the n-gram words with temporal information

1 Initialization: Randomly initialize the n-gram topic assignment for all words;2 Zero all count variables;3 for iteration← 1 to MaxIteration do4 for d← 1 to D do5 for w← 1 to Nd according to word order do6 Draw z(d)

w , x (d)w defined in Equation 3;

7 if x (d)w ← 0 then

8 Update nzw ;9 end

10 else11 Update mzw ;12 end13 Update qdz , pzw ;14 end15 end16 for z← 1 to T do17 Update Ωz by the method of moments as in Equations 8 and 9;18 end19 end20 Compute the posterior estimates of α, β, γ, δ defined in Equations 4, 5, 6, 7;

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Empirical EvaluationData Sets

We have conducted experiments on two datasets1 U.S. Presidential State-of-the-Union1 speeches from 1790 to

2002.2 NIPS conference papers - The original raw NIPS dataset2

consists of 17 years of conference papers. But we supplementedthis dataset by including some new raw NIPS documents3 and ithas 19 years of papers in total.

Preprocessing1 Removed stopwords.2 Did not perform word stemming.

1http://infomotions.com/etexts/gutenberg/dirs/etext04/suall11.txt2http://www.cs.nyu.edu/ roweis/data.html3http://ai.stanford.edu/ gal/Data/NIPS/

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Qualitative Results

1800 1850 1900 1950 20000

1000

2000

3000

Year

Our Model Mexican War

1800 1850 1900 1950 20000

2000

4000

6000

Year

TOT Mexican War

1. east bank 8. military2. american coins 9. general herrera

3. mexican flag 10. foreign coin4. separate independent 11. military usurper

5. american commonwealth 12. mexican treasury6. mexican population 13. invaded texas

7. texan troops 14. veteran troops

1. mexico 8. territory2. texas 9. army3. war 10. peace

4. mexican 11. act5. united 12. policy

6. country 13. foreign7. government 14. citizens

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Qualitative ResultsTopics changes over time

1800 1850 1900 1950 20000

2000

4000

6000

Year

Our Model Panama Canal

1800 1850 1900 1950 20000

2000

4000

6000

Year

TOT Panama Canal

1. panama canal 8. united states senate2. isthmian canal 9. french canal company

3. isthmus panama 10. caribbean sea4. republic panama 11. panama canal bonds

5. united states government 12. panama6. united states 13. american control7. state panama 14. canal

1. government 8. spanish2. cuba 9. island

3. islands 10. act4. international 11. commission

5. powers 12. officers6. gold 13. spain

7. action 14. rico

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Qualitative ResultsTopics changes over time - TOT

cellscellmodelresponsefiringactivityinputneuronsstimulusfigure

NIPS-1987

networklearninginputunits

trainingoutputlayer

hiddenweights

networks

NIPS-1988

datamodel

algorithmmethod

probabilitymodels

problemdistributioninformation

NIPS-1995

functiondataset

distributionmodelmodelsneural

probabilityparametersnetworks

NIPS-1996 NIPS-2004 NIPS-2005

algorithmstate

learningtime

algorithmsstep

actionnodepolicy

learningdataset

trainingalgorithm

testnumberkernel

classificationclassset sequence

Figure : Top ten probable phrases from the posterior inference in NIPSyear-wise.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Qualitative ResultsTopics changes over time - Our Model

I

orientation mapfiring thresholdtime delayneural state

low conduction safetycorrelogram peak

centric modelslong channelsynaptic chip

frog sciatic nerve

NIPS-1987

neural networkshidden unitshidden layerneural networktraining setmit press

hidden unitlearning algorithm

output unitsoutput layer

NIPS-1988

linear algebrainput signals

gaussian filtersoptical flow

model matching

resistive lineinput signalanalog vlsidepth map

temporal precision

NIPS-1995

probability vector

relevant documentscontinuous embeddingdoubly stochastic matrix

probability vectorsbinding energyenergy costs

variability indexlearning bayesianpolynomial time

NIPS-1996 NIPS-2004

NIPS-2005

optimal policy

build stackreinforcement learning

nash equilibriumsuit stack

synthetic itemscompressed map

reward functiontd networks

intrinsic reward

kernel ccaempirical risktraining sample

data clusteringrandom selectiongaussian regressiononline hypothesislinear separatorscovariance operatorline algorithm

Figure : Top ten probable phrases from the posterior inference in NIPSyear-wise. Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Qualitative ResultsTopics changes over time

1990 1995 2000 20050

1000

2000

3000

4000

5000

1. hidden unit 6. learning algorithms2. neural net 7. error signals3. input layer 8. recurrent connections

4. recurrent network 9. training pattern5. hidden layers 10. recurrent cascade

1. state 6. sequences2. time 7. recurrent

3. sequence 8. models4. states 9. markov5. model 10. transition

Figure : A topic related to “recurrent NNs” comprising of n-gram wordsobtained from both the models. Histograms depict the way topics aredistributed over time and they are fitted with Beta probability density functions.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Quantitative ResultsPredicting decade on State-of-the-Union dataset

1 Computed the time-stamp prediction performance.2 Learn a model on some subset of the data randomly sampled

from the collection.3 Given a new document, compute the likelihood of the decade

prediction.

L1 Error E(L1) AccuracyOur Model 1.60 1.65 0.25

TOT 1.95 1.99 0.20

Table : Results of decade prediction in the State-of-the-Union speechesdataset.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

Conclusions and Future Work

1 We have presented an n-gram topic model which can captureboth temporal structure and n-gram words in the time-stampeddocuments.

2 Topics found by our model are more interpretable with betterqualitative and quantitative performance on two publicly availabledatasets.

3 We have derived a collapsed Gibbs sampler for faster posteriorinference.

4 An advantage of our model is that it does away with ambiguitiesthat might appear among the words in topics.

Future WorkExplore non-parametric methods for n-gram topics over time.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia

ReferencesDavid M. Blei and John D. Lafferty. 2006. Dynamic topic models.In Proc. of ICML. 113-120.Knights, D., Mozer, M., and Nicolov, N. (2009). Detecting topic driftwith compound topic models. Proc. of the ICWSM’09.Noriaki Kawamae. 2011. Trend analysis model: Trend consists oftemporal words, topics, and timestamps. In Proc. of WSDM.317-326.Hanna M. Wallach. 2006. Topic modeling: beyond bag-of-words.In Proc. of ICML, 977-984.Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. (2007). Topicsin semantic representation. Psychological review, 114(2), 211.Wang, X., McCallum, A., and Wei, X. 2007. Topical n-grams:Phrase and topic discovery, with an application to informationretrieval. In Proc. of ICDM, (pp. 697-702).Xuerui Wang and Andrew McCallum. 2006. Topics over time: anon-Markov continuous-time model of topical trends. In Proc. ofKDD . 424-433.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia