Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...

16
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana- Champaign

Transcript of Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...

Page 1: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

Automatic Labeling of Multinomial Topic Models

Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai

University of Illinois at Urbana-Champaign

Page 2: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

2

Outline

• Background: statistical topic models

• Labeling a topic model– Criteria and challenge

• Our approach: a probabilistic framework

• Experiments

• Summary

Page 3: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

3

Statistical Topic Models for Text Mining

Text Collections

ProbabilisticTopic Modeling

…web 0.21search 0.10link 0.08 graph 0.05…

Subtopic discovery

Opinion comparison

Summarization

Topical pattern analysis

term 0.16relevance 0.08weight 0.07 feedback 0.04independ. 0.03model 0.03…

Topic models(Multinomial distributions)

PLSA [Hofmann 99]

LDA [Blei et al. 03]

Author-Topic [Steyvers et al. 04]

CPLSA [Mei & Zhai 06]

Pachinko allocation[Li & McCallum 06]

Topic over time[Wang et al. 06]

Page 4: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

4

Topic Models: Hard to Interpret

• Use top words– automatic, but hard to make sense

• Human generated labels– Make sense, but cannot scale up

term 0.16relevance 0.08weight 0.07 feedback 0.04independence 0.03model 0.03frequent 0.02probabilistic 0.02document 0.02…

Retrieval Models

Question: Can we automatically generate understandable labels for topics?

Term, relevance, weight, feedback

insulin foragingforagerscollectedgrainsloadscollectionnectar…

?

Page 5: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

5

What is a Good Label?

• Semantically close (relevance)• Understandable – phrases?• High coverage inside topic• Discriminative across topics• …

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

iPod Nano

Pseudo-feedback

Information Retrieval

Retrieval models

じょうほうけんさく

– Mei and Zhai 06: a topic in SIGIR

Page 6: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

6

Our Method

Collection (e.g., SIGIR)

term 0.16relevance 0.07weight 0.07

feedback 0.04independence 0.03model 0.03…filtering 0.21collaborative 0.15… trec 0.18

evaluation 0.10…

NLP ChunkerNgram Stat.

information retrieval, retrieval model, index structure, relevance feedback, …

Candidate label pool1

Relevance Score

Information retrieval 0.26retrieval models 0.19IR models 0.17 pseudo feedback 0.06……

2 Discrimination3information retriev. 0.26 0.01retrieval models 0.20IR models 0.18 pseudo feedback 0.09……

4 Coverage

retrieval models 0.20IR models 0.18 0.02 pseudo feedback 0.09……information retrieval 0.01

Page 7: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

7

Relevance (Task 2): the Zero-Order Score

• Intuition: prefer phrases well covering top words

Clustering

dimensional

algorithm

birch

shape

Latent Topic

Good Label (l1): “clustering algorithm”

body

Bad Label (l2): “body shape”

p(w|)

p(“clustering”|) = 0.4

p(“dimensional”|) = 0.3

p(“body”|) = 0.001

p(“shape”|) = 0.01

√>)lg(

)|lg(

orithmaclusteringp

orithmaclusteringp

)(

)|(

shapebodyp

shapebodyp

?

Page 8: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

8

Clustering

hash

dimension

algorithm

partition

p(w | clustering algorithm )

Good Label (l1)“clustering algorithm”

Clustering

hash

dimension

key

algorithm…

p(w | hash join)

key …hash join… code …hashtable …search…hash join…

map key…hash…algorithm…key

…hash…keytable…join…

l2: “hash join”

Relevance (Task 2): the First-Order Score

• Intuition: prefer phrases with similar context (distribution)

Clustering

dimension

partition

algorithm

hash

Topic

…P(w|)

w

rank

ClwPMIwp )|,()|(

Score (l, ) = D(||l)

Page 9: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

9

Discrimination and Coverage (Tasks 3 & 4)

• Discriminative across topic:– High relevance to target topic, low relevance to

other topics

• High Coverage inside topic:– Use MMR strategy

)],'(max)1(),([maxargˆ'

llSimlScorelSlSLl

),(),(),(' ,...,1,1,...,1 kiiii lScorelScorelScore

Page 10: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

10

Variations and Applications

• Labeling document clusters– Document cluster unigram language model– Applicable to any task with unigram language model

• Context sensitive labels– Label of a topic is sensitive to the context– An alternative way to approach contextual text mining

tree, prune, root, branch “tree algorithms” in CS ? in horticulture

? in marketing?

Page 11: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

11

Experiments

• Datasets:– SIGMOD abstracts; SIGIR abstracts; AP news data– Candidate labels: significant bigrams; NLP chunks

• Topic models:– PLSA, LDA

• Evaluation:– Human annotators to compare labels generated from

anonymous systems– Order of systems randomly perturbed; score average

over all sample topics

Page 12: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

12

Result Summary

• Automatic phrase labels >> top words• 1-order relevance >> 0-order relevance• Bigram > NLP chunks

– Bigram works better with literature; NLP better with news

• System labels << human labels– Scientific literature is an easier task

Page 13: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

13

Results: Sample Topic Labels

tree 0.09trees 0.08spatial 0.08b 0.05r 0.04disk 0.02array 0.01cache 0.01

north 0.02case 0.01trial 0.01iran 0.01documents 0.01walsh 0.009reagan 0.009charges 0.007

the, of, a, and,to, data, > 0.02…clustering 0.02time 0.01clusters 0.01databases 0.01large 0.01performance 0.01quality 0.005

clustering algorithmclustering structure

large data, data quality, high data,

data application, …

iran contra…

r treeb tree …

indexing methods

Page 14: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

14

Results: Context-Sensitive Labeling

samplingestimationapproximationhistogramselectivityhistograms…

selectivity estimation;random sampling;

approximate answers;

distributed retrieval;parameter estimation;

mixture models;

Context: Database(SIGMOD Proceedings)

Context: IR(SIGIR Proceedings)

• Explore the different meaning of a topic with different contexts (content switch)

• An alternative approach to contextual text mining

Page 15: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

15

Summary

• Labeling: A postprocessing step of all multinomial topic models

• A probabilistic approach to generate good labels– understandable, relevant, high coverage, discriminative

• Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive

• Future work:– Labeling hierarchical topic models– Incorporating priors

Page 16: Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

16

Thanks!- Please come to our poster tonight (#40)