Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...

Automatic Labeling of Multinomial Topic Models

Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai

University of Illinois at Urbana-Champaign

http://www.cs.uiuc.edu/

2

Outline

• Background: statistical topic models

• Labeling a topic model– Criteria and challenge

• Our approach: a probabilistic framework

• Experiments

• Summary


3

Statistical Topic Models for Text Mining

Text Collections

ProbabilisticTopic Modeling

…web 0.21search 0.10link 0.08 graph 0.05…

…

Subtopic discovery

Opinion comparison

Summarization

Topical pattern analysis

…

term 0.16relevance 0.08weight 0.07 feedback 0.04independ. 0.03model 0.03…

Topic models(Multinomial distributions)

PLSA [Hofmann 99]

LDA [Blei et al. 03]

Author-Topic [Steyvers et al. 04]

CPLSA [Mei & Zhai 06]

…

Pachinko allocation[Li & McCallum 06]

Topic over time[Wang et al. 06]


4

Topic Models: Hard to Interpret

• Use top words– automatic, but hard to make sense

• Human generated labels– Make sense, but cannot scale up

term 0.16relevance 0.08weight 0.07 feedback 0.04independence 0.03model 0.03frequent 0.02probabilistic 0.02document 0.02…

Retrieval Models

Question: Can we automatically generate understandable labels for topics?

Term, relevance, weight, feedback

insulin foragingforagerscollectedgrainsloadscollectionnectar…

?


5

What is a Good Label?

• Semantically close (relevance)• Understandable – phrases?• High coverage inside topic• Discriminative across topics• …

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

iPod Nano

Pseudo-feedback

Information Retrieval

Retrieval models

じょうほうけんさく

– Mei and Zhai 06: a topic in SIGIR


6

Our Method

Collection (e.g., SIGIR)

term 0.16relevance 0.07weight 0.07

feedback 0.04independence 0.03model 0.03…filtering 0.21collaborative 0.15… trec 0.18

evaluation 0.10…

NLP ChunkerNgram Stat.

information retrieval, retrieval model, index structure, relevance feedback, …

Candidate label pool1

Relevance Score

Information retrieval 0.26retrieval models 0.19IR models 0.17 pseudo feedback 0.06……

2 Discrimination3information retriev. 0.26 0.01retrieval models 0.20IR models 0.18 pseudo feedback 0.09……

4 Coverage

retrieval models 0.20IR models 0.18 0.02 pseudo feedback 0.09……information retrieval 0.01


7

Relevance (Task 2): the Zero-Order Score

• Intuition: prefer phrases well covering top words

Clustering

dimensional

algorithm

birch

shape

Latent Topic

…

Good Label (l1): “clustering algorithm”

body

Bad Label (l2): “body shape”

…

p(w|)

p(“clustering”|) = 0.4

p(“dimensional”|) = 0.3

p(“body”|) = 0.001

p(“shape”|) = 0.01

√>)lg(

)|lg(

orithmaclusteringp

orithmaclusteringp

)(

)|(

shapebodyp

shapebodyp

?


8

Clustering

hash

dimension

algorithm

partition

…

p(w | clustering algorithm )

Good Label (l1)“clustering algorithm”

Clustering

hash

dimension

key

algorithm…

p(w | hash join)

key …hash join… code …hashtable …search…hash join…

map key…hash…algorithm…key

…hash…keytable…join…

l2: “hash join”

Relevance (Task 2): the First-Order Score

• Intuition: prefer phrases with similar context (distribution)

Clustering

dimension

partition

algorithm

hash

Topic

…P(w|)

w

rank

ClwPMIwp )|,()|(

Score (l, ) = D(||l)


9

Discrimination and Coverage (Tasks 3 & 4)

• Discriminative across topic:– High relevance to target topic, low relevance to

other topics

• High Coverage inside topic:– Use MMR strategy

)],'(max)1(),([maxargˆ'

llSimlScorelSlSLl

),(),(),(' ,...,1,1,...,1 kiiii lScorelScorelScore


10

Variations and Applications

• Labeling document clusters– Document cluster unigram language model– Applicable to any task with unigram language model

• Context sensitive labels– Label of a topic is sensitive to the context– An alternative way to approach contextual text mining

tree, prune, root, branch “tree algorithms” in CS ? in horticulture

? in marketing?


11

Experiments

• Datasets:– SIGMOD abstracts; SIGIR abstracts; AP news data– Candidate labels: significant bigrams; NLP chunks

• Topic models:– PLSA, LDA

• Evaluation:– Human annotators to compare labels generated from

anonymous systems– Order of systems randomly perturbed; score average

over all sample topics


12

Result Summary

• Automatic phrase labels >> top words• 1-order relevance >> 0-order relevance• Bigram > NLP chunks

– Bigram works better with literature; NLP better with news

• System labels << human labels– Scientific literature is an easier task


13

Results: Sample Topic Labels

tree 0.09trees 0.08spatial 0.08b 0.05r 0.04disk 0.02array 0.01cache 0.01

north 0.02case 0.01trial 0.01iran 0.01documents 0.01walsh 0.009reagan 0.009charges 0.007

the, of, a, and,to, data, > 0.02…clustering 0.02time 0.01clusters 0.01databases 0.01large 0.01performance 0.01quality 0.005

clustering algorithmclustering structure

…

large data, data quality, high data,

data application, …

iran contra…

r treeb tree …

indexing methods


14

Results: Context-Sensitive Labeling

samplingestimationapproximationhistogramselectivityhistograms…

selectivity estimation;random sampling;

approximate answers;

distributed retrieval;parameter estimation;

mixture models;

Context: Database(SIGMOD Proceedings)

Context: IR(SIGIR Proceedings)

• Explore the different meaning of a topic with different contexts (content switch)

• An alternative approach to contextual text mining


15

Summary

• Labeling: A postprocessing step of all multinomial topic models

• A probabilistic approach to generate good labels– understandable, relevant, high coverage, discriminative

• Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive

• Future work:– Labeling hierarchical topic models– Incorporating priors


16

Thanks!- Please come to our poster tonight (#40)


Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...

Documents

Transcript of Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...