An N-gram Topic Model for Time-Stamped Documentsmsjameel/presentations/ecir... · · 2014-08-15An...
Transcript of An N-gram Topic Model for Time-Stamped Documentsmsjameel/presentations/ecir... · · 2014-08-15An...
An N-gram Topic Model for Time-StampedDocuments
Shoaib Jameel and Wai Lam
The Chinese University of Hong Kong
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Outline
Introduction and MotivationI The Bag-of-Words (BoW) assumptionI Temporal nature of data
Related WorkI Temporal Topic ModelsI N-gram Topic Models
Overview of our modelI Background
F Topics Over Time (TOT) Model - proposed earlierF Our proposed n-gram model
Empirical EvaluationConclusions and Future Directions
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
The ‘popular’ Bag-of-Words AssumptionMany works in the topic modeling literature assumeexchangeability among the words.As a result generate ambiguous words in topics.For example, consider few topics obtained from the NIPScollection using the Latent Dirichlet Allocation (LDA) model:
Example
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5architecture order connectionist potential priorrecurrent first role membrane bayesiannetwork second binding current datamodule analysis structures synaptic evidencemodules small distributed dendritic experts
The problem with the LDA modelWords in topics are not insightful.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
The ‘popular’ Bag-of-Words AssumptionMany works in the topic modeling literature assumeexchangeability among the words.As a result generate ambiguous words in topics.For example, consider few topics obtained from the NIPScollection using the Latent Dirichlet Allocation (LDA) model:
Example
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5architecture order connectionist potential priorrecurrent first role membrane bayesiannetwork second binding current datamodule analysis structures synaptic evidencemodules small distributed dendritic experts
The problem with the LDA modelWords in topics are not insightful.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
The problem with the bag-of-wordsassumption
1 Logical structure of the document is lost. For example, we do notknow whether “the cat saw a dog or a dog saw a cat”.
2 The computational models cannot tap an extra word orderinformation inherent in the text. Therefore, affects theperformance.
3 The usefulness of maintaining the word order has also beenillustrated in Information Retrieval, Computational Linguistics andmany other fields.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
The problem with the bag-of-wordsassumption
1 Logical structure of the document is lost. For example, we do notknow whether “the cat saw a dog or a dog saw a cat”.
2 The computational models cannot tap an extra word orderinformation inherent in the text. Therefore, affects theperformance.
3 The usefulness of maintaining the word order has also beenillustrated in Information Retrieval, Computational Linguistics andmany other fields.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
The problem with the bag-of-wordsassumption
1 Logical structure of the document is lost. For example, we do notknow whether “the cat saw a dog or a dog saw a cat”.
2 The computational models cannot tap an extra word orderinformation inherent in the text. Therefore, affects theperformance.
3 The usefulness of maintaining the word order has also beenillustrated in Information Retrieval, Computational Linguistics andmany other fields.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Why capture topics over time?
1 We know that data evolves over time.2 What people are talking today may not be talking tomorrow or an
year after.
Burj Khalifa
VolcanoManila Hostage
Iraq War
Year-2010
Wikipedia
N.Z Earthquake
Osama bin LadenHiggs Boson
Year-2011 Year-2012
Gaza Strip
Sachin Tendulkar
China
Apple Inc.
3 Models such as LDA do not capture such temporal characteristicsin data.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Related WorkTemporal Topic Models
Discrete time assumption modelsI Blei et al., (David M. Blei and John D. Lafferty. 2006.) - Dynamic
Topic Models - assume that topics in one year are dependent onthe topics of the previous year.
I Knights et al., (Knights, D., Mozer, M., and Nicolov, N. 2009.) -Compound Topic Model - Train a topic model on the most recent Kmonths of data.
The problem hereOne needs to select an appropriate time slice value manually. Thequestion is which time slice be chosen: day, month, year, etc.?
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Related WorkTemporal Topic Models
Discrete time assumption modelsI Blei et al., (David M. Blei and John D. Lafferty. 2006.) - Dynamic
Topic Models - assume that topics in one year are dependent onthe topics of the previous year.
I Knights et al., (Knights, D., Mozer, M., and Nicolov, N. 2009.) -Compound Topic Model - Train a topic model on the most recent Kmonths of data.
The problem hereOne needs to select an appropriate time slice value manually. Thequestion is which time slice be chosen: day, month, year, etc.?
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Related WorkTemporal Topic Models
Continuous Time Topic ModelsI Noriaki (Noriaki Kawamae. 2011.) - Trend Analysis Model - The
model has a probability distribution over temporal words, topics,and a continuous distribution over time.
I Uri et al., (Uri Nodelman, Christian R. Shelton, and Daphne Koller.2002.) - Continuous Time Bayesian Networks - Builds a graphwhere each variable lies in the node whose values change overtime.
The problem with the above modelsAll assume the notion of exchangeability and thus lose importantcollocation information inherent in the document.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Related WorkTemporal Topic Models
Continuous Time Topic ModelsI Noriaki (Noriaki Kawamae. 2011.) - Trend Analysis Model - The
model has a probability distribution over temporal words, topics,and a continuous distribution over time.
I Uri et al., (Uri Nodelman, Christian R. Shelton, and Daphne Koller.2002.) - Continuous Time Bayesian Networks - Builds a graphwhere each variable lies in the node whose values change overtime.
The problem with the above modelsAll assume the notion of exchangeability and thus lose importantcollocation information inherent in the document.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Related WorkN-gram Topic Models
1 Wallach’s (Hanna M. Wallach. 2006.) bigram topic model.Maintains word order during topic generation process. Generatesonly bigram words in topics.
2 Griffiths et al. (Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B.2007.) - LDA Collocation Model. Introduced binary randomvariables which decides when to generate a unigram or a bigram.
3 Wang et al. (Wang, X., McCallum, A., and Wei, X. 2007.) - TopicalN-gram Model - Extends the LDA Collocation Model. Gives topicassignment to every word in the phrase.
The problem with the above modelsCannot capture the temporal dynamics in data.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Related WorkN-gram Topic Models
1 Wallach’s (Hanna M. Wallach. 2006.) bigram topic model.Maintains word order during topic generation process. Generatesonly bigram words in topics.
2 Griffiths et al. (Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B.2007.) - LDA Collocation Model. Introduced binary randomvariables which decides when to generate a unigram or a bigram.
3 Wang et al. (Wang, X., McCallum, A., and Wei, X. 2007.) - TopicalN-gram Model - Extends the LDA Collocation Model. Gives topicassignment to every word in the phrase.
The problem with the above modelsCannot capture the temporal dynamics in data.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Topics Over Time (TOT) (Wang et al., 2006)1 Our model extends from this model.2 Assumes the notion of word and topic exchangeability.
Generative Process1 Draw T multinomials φz from a
Dirichlet Prior β, one for eachtopic z
2 For each document d , draw amultinomial θ(d) from a Dirichletprior α; then for each word w (d)
i inthe document d
1 Draw a topic zdi from
Multinomial θ(d)
2 Draw a word w (d)i from
Multinomial φz(d)
i
3 Draw a timestamp t (d)i from
Beta Ωz(d)
i
Topics Over Time Model (TOT)
I
w t
z
θ
α
Ωφ
β
DNd
T
Fig. 1. TOT modelShoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Topics Over Time Model (TOT)
1 The model assumes a continuous distribution over timeassociated with each topic.
2 Topics are responsible for generating both observed time-stampsand also words.
3 The model does not capture the sequence of state changes with aMarkov assumption.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Topics Over Time Model (TOT)
1 The model assumes a continuous distribution over timeassociated with each topic.
2 Topics are responsible for generating both observed time-stampsand also words.
3 The model does not capture the sequence of state changes with aMarkov assumption.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Topics Over Time Model (TOT)
1 The model assumes a continuous distribution over timeassociated with each topic.
2 Topics are responsible for generating both observed time-stampsand also words.
3 The model does not capture the sequence of state changes with aMarkov assumption.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Topics Over Time Model (TOT)Posterior Inference
1 In the Gibbs sampling, compute the conditional:
P(z(d)i |w, t, z
(d)i , α, β,Ω) (1)
2 We can thus write the updating equations as:
P(z(d)i |w, t, z
(d)i , α, β,Ω) ∝
(m
z(d)i
+ αz(d)
i− 1)
×n
z(d)i w (d)
i+ β
w (d)i− 1∑W
v=1(nz(d)
i v+ βv )− 1
× (1− t(d)i )
Ωz(d)i 1−1
t(d)Ω
z(d)i 2−1
iB(Ω
z(d)i 1
,Ωz(d)
i 2)
(2)
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Our ModelN-gram Topics Over Time Model
1 The model assumes a continuous distribution over timeassociated with each topic.
2 Topics are responsible for generating both observed time-stampsand also words.
3 The model does not capture the sequence of state changes with aMarkov assumption.
4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Our ModelN-gram Topics Over Time Model
1 The model assumes a continuous distribution over timeassociated with each topic.
2 Topics are responsible for generating both observed time-stampsand also words.
3 The model does not capture the sequence of state changes with aMarkov assumption.
4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Our ModelN-gram Topics Over Time Model
1 The model assumes a continuous distribution over timeassociated with each topic.
2 Topics are responsible for generating both observed time-stampsand also words.
3 The model does not capture the sequence of state changes with aMarkov assumption.
4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Our ModelN-gram Topics Over Time Model
1 The model assumes a continuous distribution over timeassociated with each topic.
2 Topics are responsible for generating both observed time-stampsand also words.
3 The model does not capture the sequence of state changes with aMarkov assumption.
4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Our ModelN-gram Topics Over Time Model
1 The model assumes a continuous distribution over timeassociated with each topic.
2 Topics are responsible for generating both observed time-stampsand also words.
3 The model does not capture the sequence of state changes with aMarkov assumption.
4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Our ModelN-gram Topics Over Time Model
1 The model assumes a continuous distribution over timeassociated with each topic.
2 Topics are responsible for generating both observed time-stampsand also words.
3 The model does not capture the sequence of state changes with aMarkov assumption.
4 Maintains the order of words during topic generation process.5 Generates words as unigrams, bigrams, etc. in topics.6 Results in more interpretable topics.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Graphical ModelN-gram Topics Over Time Model
I
α
θ
ti−1 ti ti+1xi xi+1
zi−1 zi zi+1
wi−1 wi wi+1D
TWψγ
T
φβ δ σTW
Ω
xi+2
Fig. 1. Our model
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Generative ProcessN-gram Topics Over Time Model
Draw Discrete(φz) from Dirichlet(β) for each topic z;Draw Bernoulli(ψzw ) from Beta(γ) for each topic z and each word w ;Draw Discrete(σzw ) from Dirichlet(δ) for each topic z and each wordw ;For every document d , draw Discrete(θ(d)) from Dirichlet(α);foreach word w (d)
i in document d doDraw x (d)
i from Bernoulli(ψz(d)
i−1w (d)i−1
);
Draw z(d)i from Discrete(θ(d));
Draw w (d)i from Discrete(σ
z(d)i w (d)
i−1) if x (d)
i = 1;
Otherwise, Draw w (d)i from Discrete(φ
z(d)i
);
Draw a time-stamp t(d)i from Beta(Ω
z(d)i
);
end
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Posterior InferenceCollapsed Gibbs Sampling
P(z(d)i , x (d)
i |w, t, x(d)¬i , z
(d)¬i , α, β, γ, δ,Ω) ∝
(γxi
(d) + pz(d)
i−1w(d)i−1xi
− 1)×(α
z(d)i
+ qdz(d)
i− 1)×
(1− t(d)i )
Ωz(d)i 1−1
t(d)Ω
z(d)i 2−1
iB(Ω
z(d)i 1
,Ωz(d)
i 2)
×
βw(d)
i+n
z(d)i w(d)
i −1∑Wv=1(βv +n
z(d)i v
)−1if x (d)
i = 0
δw(d)
i+m
z(d)i w(d)
i−1w(d)i−1∑W
v=1(δv +mz(d)i w(d)
i−1v)−1
if x (d)i = 1
(3)
Posterior Estimates
θ(d)z =
αz + qdz∑Tt=1(αt + qdt )
(4) φzw =βw + nzw∑W
v=1(βv + nzv )(5) ψzwk =
γk + pzwk∑1k=0(γk + pzwk )
(6)
σzwv =δv + mzwv∑W
v=1(δv + mzwv )(7) Ωz1 = tz
( tz (1− tz )
s2z
− 1)
(8) Ωz2 = (1− tz )( tz (1− tz )
s2z
− 1)
(9)
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Inference AlgorithmInput : γ, δ, α,T , β,Corpus,MaxIterationOutput: Topic assignments for all the n-gram words with temporal information
1 Initialization: Randomly initialize the n-gram topic assignment for all words;2 Zero all count variables;3 for iteration← 1 to MaxIteration do4 for d← 1 to D do5 for w← 1 to Nd according to word order do6 Draw z(d)
w , x (d)w defined in Equation 3;
7 if x (d)w ← 0 then
8 Update nzw ;9 end
10 else11 Update mzw ;12 end13 Update qdz , pzw ;14 end15 end16 for z← 1 to T do17 Update Ωz by the method of moments as in Equations 8 and 9;18 end19 end20 Compute the posterior estimates of α, β, γ, δ defined in Equations 4, 5, 6, 7;
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Empirical EvaluationData Sets
We have conducted experiments on two datasets1 U.S. Presidential State-of-the-Union1 speeches from 1790 to
2002.2 NIPS conference papers - The original raw NIPS dataset2
consists of 17 years of conference papers. But we supplementedthis dataset by including some new raw NIPS documents3 and ithas 19 years of papers in total.
Preprocessing1 Removed stopwords.2 Did not perform word stemming.
1http://infomotions.com/etexts/gutenberg/dirs/etext04/suall11.txt2http://www.cs.nyu.edu/ roweis/data.html3http://ai.stanford.edu/ gal/Data/NIPS/
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Qualitative Results
1800 1850 1900 1950 20000
1000
2000
3000
Year
Our Model Mexican War
1800 1850 1900 1950 20000
2000
4000
6000
Year
TOT Mexican War
1. east bank 8. military2. american coins 9. general herrera
3. mexican flag 10. foreign coin4. separate independent 11. military usurper
5. american commonwealth 12. mexican treasury6. mexican population 13. invaded texas
7. texan troops 14. veteran troops
1. mexico 8. territory2. texas 9. army3. war 10. peace
4. mexican 11. act5. united 12. policy
6. country 13. foreign7. government 14. citizens
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Qualitative ResultsTopics changes over time
1800 1850 1900 1950 20000
2000
4000
6000
Year
Our Model Panama Canal
1800 1850 1900 1950 20000
2000
4000
6000
Year
TOT Panama Canal
1. panama canal 8. united states senate2. isthmian canal 9. french canal company
3. isthmus panama 10. caribbean sea4. republic panama 11. panama canal bonds
5. united states government 12. panama6. united states 13. american control7. state panama 14. canal
1. government 8. spanish2. cuba 9. island
3. islands 10. act4. international 11. commission
5. powers 12. officers6. gold 13. spain
7. action 14. rico
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Qualitative ResultsTopics changes over time - TOT
cellscellmodelresponsefiringactivityinputneuronsstimulusfigure
NIPS-1987
networklearninginputunits
trainingoutputlayer
hiddenweights
networks
NIPS-1988
datamodel
algorithmmethod
probabilitymodels
problemdistributioninformation
NIPS-1995
functiondataset
distributionmodelmodelsneural
probabilityparametersnetworks
NIPS-1996 NIPS-2004 NIPS-2005
algorithmstate
learningtime
algorithmsstep
actionnodepolicy
learningdataset
trainingalgorithm
testnumberkernel
classificationclassset sequence
Figure : Top ten probable phrases from the posterior inference in NIPSyear-wise.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Qualitative ResultsTopics changes over time - Our Model
I
orientation mapfiring thresholdtime delayneural state
low conduction safetycorrelogram peak
centric modelslong channelsynaptic chip
frog sciatic nerve
NIPS-1987
neural networkshidden unitshidden layerneural networktraining setmit press
hidden unitlearning algorithm
output unitsoutput layer
NIPS-1988
linear algebrainput signals
gaussian filtersoptical flow
model matching
resistive lineinput signalanalog vlsidepth map
temporal precision
NIPS-1995
probability vector
relevant documentscontinuous embeddingdoubly stochastic matrix
probability vectorsbinding energyenergy costs
variability indexlearning bayesianpolynomial time
NIPS-1996 NIPS-2004
NIPS-2005
optimal policy
build stackreinforcement learning
nash equilibriumsuit stack
synthetic itemscompressed map
reward functiontd networks
intrinsic reward
kernel ccaempirical risktraining sample
data clusteringrandom selectiongaussian regressiononline hypothesislinear separatorscovariance operatorline algorithm
Figure : Top ten probable phrases from the posterior inference in NIPSyear-wise. Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Qualitative ResultsTopics changes over time
1990 1995 2000 20050
1000
2000
3000
4000
5000
1. hidden unit 6. learning algorithms2. neural net 7. error signals3. input layer 8. recurrent connections
4. recurrent network 9. training pattern5. hidden layers 10. recurrent cascade
1. state 6. sequences2. time 7. recurrent
3. sequence 8. models4. states 9. markov5. model 10. transition
Figure : A topic related to “recurrent NNs” comprising of n-gram wordsobtained from both the models. Histograms depict the way topics aredistributed over time and they are fitted with Beta probability density functions.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Quantitative ResultsPredicting decade on State-of-the-Union dataset
1 Computed the time-stamp prediction performance.2 Learn a model on some subset of the data randomly sampled
from the collection.3 Given a new document, compute the likelihood of the decade
prediction.
L1 Error E(L1) AccuracyOur Model 1.60 1.65 0.25
TOT 1.95 1.99 0.20
Table : Results of decade prediction in the State-of-the-Union speechesdataset.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
Conclusions and Future Work
1 We have presented an n-gram topic model which can captureboth temporal structure and n-gram words in the time-stampeddocuments.
2 Topics found by our model are more interpretable with betterqualitative and quantitative performance on two publicly availabledatasets.
3 We have derived a collapsed Gibbs sampler for faster posteriorinference.
4 An advantage of our model is that it does away with ambiguitiesthat might appear among the words in topics.
Future WorkExplore non-parametric methods for n-gram topics over time.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia
ReferencesDavid M. Blei and John D. Lafferty. 2006. Dynamic topic models.In Proc. of ICML. 113-120.Knights, D., Mozer, M., and Nicolov, N. (2009). Detecting topic driftwith compound topic models. Proc. of the ICWSM’09.Noriaki Kawamae. 2011. Trend analysis model: Trend consists oftemporal words, topics, and timestamps. In Proc. of WSDM.317-326.Hanna M. Wallach. 2006. Topic modeling: beyond bag-of-words.In Proc. of ICML, 977-984.Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. (2007). Topicsin semantic representation. Psychological review, 114(2), 211.Wang, X., McCallum, A., and Wei, X. 2007. Topical n-grams:Phrase and topic discovery, with an application to informationretrieval. In Proc. of ICDM, (pp. 697-702).Xuerui Wang and Andrew McCallum. 2006. Topics over time: anon-Markov continuous-time model of topical trends. In Proc. ofKDD . 424-433.
Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia