Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...
-
Upload
elinor-osborne -
Category
Documents
-
view
223 -
download
0
Transcript of Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...
![Page 1: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/1.jpg)
Automatic Labeling of Multinomial Topic Models
Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai
University of Illinois at Urbana-Champaign
![Page 2: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/2.jpg)
2
Outline
• Background: statistical topic models
• Labeling a topic model– Criteria and challenge
• Our approach: a probabilistic framework
• Experiments
• Summary
![Page 3: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/3.jpg)
3
Statistical Topic Models for Text Mining
Text Collections
ProbabilisticTopic Modeling
…web 0.21search 0.10link 0.08 graph 0.05…
…
Subtopic discovery
Opinion comparison
Summarization
Topical pattern analysis
…
term 0.16relevance 0.08weight 0.07 feedback 0.04independ. 0.03model 0.03…
Topic models(Multinomial distributions)
PLSA [Hofmann 99]
LDA [Blei et al. 03]
Author-Topic [Steyvers et al. 04]
CPLSA [Mei & Zhai 06]
…
Pachinko allocation[Li & McCallum 06]
Topic over time[Wang et al. 06]
![Page 4: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/4.jpg)
4
Topic Models: Hard to Interpret
• Use top words– automatic, but hard to make sense
• Human generated labels– Make sense, but cannot scale up
term 0.16relevance 0.08weight 0.07 feedback 0.04independence 0.03model 0.03frequent 0.02probabilistic 0.02document 0.02…
Retrieval Models
Question: Can we automatically generate understandable labels for topics?
Term, relevance, weight, feedback
insulin foragingforagerscollectedgrainsloadscollectionnectar…
?
![Page 5: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/5.jpg)
5
What is a Good Label?
• Semantically close (relevance)• Understandable – phrases?• High coverage inside topic• Discriminative across topics• …
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
iPod Nano
Pseudo-feedback
Information Retrieval
Retrieval models
じょうほうけんさく
– Mei and Zhai 06: a topic in SIGIR
![Page 6: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/6.jpg)
6
Our Method
Collection (e.g., SIGIR)
term 0.16relevance 0.07weight 0.07
feedback 0.04independence 0.03model 0.03…filtering 0.21collaborative 0.15… trec 0.18
evaluation 0.10…
NLP ChunkerNgram Stat.
information retrieval, retrieval model, index structure, relevance feedback, …
Candidate label pool1
Relevance Score
Information retrieval 0.26retrieval models 0.19IR models 0.17 pseudo feedback 0.06……
2 Discrimination3information retriev. 0.26 0.01retrieval models 0.20IR models 0.18 pseudo feedback 0.09……
4 Coverage
retrieval models 0.20IR models 0.18 0.02 pseudo feedback 0.09……information retrieval 0.01
![Page 7: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/7.jpg)
7
Relevance (Task 2): the Zero-Order Score
• Intuition: prefer phrases well covering top words
Clustering
dimensional
algorithm
birch
shape
Latent Topic
…
Good Label (l1): “clustering algorithm”
body
Bad Label (l2): “body shape”
…
p(w|)
p(“clustering”|) = 0.4
p(“dimensional”|) = 0.3
p(“body”|) = 0.001
p(“shape”|) = 0.01
√>)lg(
)|lg(
orithmaclusteringp
orithmaclusteringp
)(
)|(
shapebodyp
shapebodyp
?
![Page 8: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/8.jpg)
8
Clustering
hash
dimension
algorithm
partition
…
p(w | clustering algorithm )
Good Label (l1)“clustering algorithm”
Clustering
hash
dimension
key
algorithm…
p(w | hash join)
key …hash join… code …hashtable …search…hash join…
map key…hash…algorithm…key
…hash…keytable…join…
l2: “hash join”
Relevance (Task 2): the First-Order Score
• Intuition: prefer phrases with similar context (distribution)
Clustering
dimension
partition
algorithm
hash
Topic
…P(w|)
w
rank
ClwPMIwp )|,()|(
Score (l, ) = D(||l)
![Page 9: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/9.jpg)
9
Discrimination and Coverage (Tasks 3 & 4)
• Discriminative across topic:– High relevance to target topic, low relevance to
other topics
• High Coverage inside topic:– Use MMR strategy
)],'(max)1(),([maxargˆ'
llSimlScorelSlSLl
),(),(),(' ,...,1,1,...,1 kiiii lScorelScorelScore
![Page 10: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/10.jpg)
10
Variations and Applications
• Labeling document clusters– Document cluster unigram language model– Applicable to any task with unigram language model
• Context sensitive labels– Label of a topic is sensitive to the context– An alternative way to approach contextual text mining
tree, prune, root, branch “tree algorithms” in CS ? in horticulture
? in marketing?
![Page 11: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/11.jpg)
11
Experiments
• Datasets:– SIGMOD abstracts; SIGIR abstracts; AP news data– Candidate labels: significant bigrams; NLP chunks
• Topic models:– PLSA, LDA
• Evaluation:– Human annotators to compare labels generated from
anonymous systems– Order of systems randomly perturbed; score average
over all sample topics
![Page 12: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/12.jpg)
12
Result Summary
• Automatic phrase labels >> top words• 1-order relevance >> 0-order relevance• Bigram > NLP chunks
– Bigram works better with literature; NLP better with news
• System labels << human labels– Scientific literature is an easier task
![Page 13: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/13.jpg)
13
Results: Sample Topic Labels
tree 0.09trees 0.08spatial 0.08b 0.05r 0.04disk 0.02array 0.01cache 0.01
north 0.02case 0.01trial 0.01iran 0.01documents 0.01walsh 0.009reagan 0.009charges 0.007
the, of, a, and,to, data, > 0.02…clustering 0.02time 0.01clusters 0.01databases 0.01large 0.01performance 0.01quality 0.005
clustering algorithmclustering structure
…
large data, data quality, high data,
data application, …
iran contra…
r treeb tree …
indexing methods
![Page 14: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/14.jpg)
14
Results: Context-Sensitive Labeling
samplingestimationapproximationhistogramselectivityhistograms…
selectivity estimation;random sampling;
approximate answers;
distributed retrieval;parameter estimation;
mixture models;
Context: Database(SIGMOD Proceedings)
Context: IR(SIGIR Proceedings)
• Explore the different meaning of a topic with different contexts (content switch)
• An alternative approach to contextual text mining
![Page 15: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.](https://reader036.fdocuments.us/reader036/viewer/2022082505/56649f275503460f94c3e8a1/html5/thumbnails/15.jpg)
15
Summary
• Labeling: A postprocessing step of all multinomial topic models
• A probabilistic approach to generate good labels– understandable, relevant, high coverage, discriminative
• Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive
• Future work:– Labeling hierarchical topic models– Incorporating priors