Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...

Post on 05-Jan-2016

223 views 0 download

Tags:

Transcript of Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...

Automatic Labeling of Multinomial Topic Models

Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai

University of Illinois at Urbana-Champaign

2

Outline

• Background: statistical topic models

• Labeling a topic model– Criteria and challenge

• Our approach: a probabilistic framework

• Experiments

• Summary

3

Statistical Topic Models for Text Mining

Text Collections

ProbabilisticTopic Modeling

…web 0.21search 0.10link 0.08 graph 0.05…

Subtopic discovery

Opinion comparison

Summarization

Topical pattern analysis

term 0.16relevance 0.08weight 0.07 feedback 0.04independ. 0.03model 0.03…

Topic models(Multinomial distributions)

PLSA [Hofmann 99]

LDA [Blei et al. 03]

Author-Topic [Steyvers et al. 04]

CPLSA [Mei & Zhai 06]

Pachinko allocation[Li & McCallum 06]

Topic over time[Wang et al. 06]

4

Topic Models: Hard to Interpret

• Use top words– automatic, but hard to make sense

• Human generated labels– Make sense, but cannot scale up

term 0.16relevance 0.08weight 0.07 feedback 0.04independence 0.03model 0.03frequent 0.02probabilistic 0.02document 0.02…

Retrieval Models

Question: Can we automatically generate understandable labels for topics?

Term, relevance, weight, feedback

insulin foragingforagerscollectedgrainsloadscollectionnectar…

?

5

What is a Good Label?

• Semantically close (relevance)• Understandable – phrases?• High coverage inside topic• Discriminative across topics• …

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

iPod Nano

Pseudo-feedback

Information Retrieval

Retrieval models

じょうほうけんさく

– Mei and Zhai 06: a topic in SIGIR

6

Our Method

Collection (e.g., SIGIR)

term 0.16relevance 0.07weight 0.07

feedback 0.04independence 0.03model 0.03…filtering 0.21collaborative 0.15… trec 0.18

evaluation 0.10…

NLP ChunkerNgram Stat.

information retrieval, retrieval model, index structure, relevance feedback, …

Candidate label pool1

Relevance Score

Information retrieval 0.26retrieval models 0.19IR models 0.17 pseudo feedback 0.06……

2 Discrimination3information retriev. 0.26 0.01retrieval models 0.20IR models 0.18 pseudo feedback 0.09……

4 Coverage

retrieval models 0.20IR models 0.18 0.02 pseudo feedback 0.09……information retrieval 0.01

7

Relevance (Task 2): the Zero-Order Score

• Intuition: prefer phrases well covering top words

Clustering

dimensional

algorithm

birch

shape

Latent Topic

Good Label (l1): “clustering algorithm”

body

Bad Label (l2): “body shape”

p(w|)

p(“clustering”|) = 0.4

p(“dimensional”|) = 0.3

p(“body”|) = 0.001

p(“shape”|) = 0.01

√>)lg(

)|lg(

orithmaclusteringp

orithmaclusteringp

)(

)|(

shapebodyp

shapebodyp

?

8

Clustering

hash

dimension

algorithm

partition

p(w | clustering algorithm )

Good Label (l1)“clustering algorithm”

Clustering

hash

dimension

key

algorithm…

p(w | hash join)

key …hash join… code …hashtable …search…hash join…

map key…hash…algorithm…key

…hash…keytable…join…

l2: “hash join”

Relevance (Task 2): the First-Order Score

• Intuition: prefer phrases with similar context (distribution)

Clustering

dimension

partition

algorithm

hash

Topic

…P(w|)

w

rank

ClwPMIwp )|,()|(

Score (l, ) = D(||l)

9

Discrimination and Coverage (Tasks 3 & 4)

• Discriminative across topic:– High relevance to target topic, low relevance to

other topics

• High Coverage inside topic:– Use MMR strategy

)],'(max)1(),([maxargˆ'

llSimlScorelSlSLl

),(),(),(' ,...,1,1,...,1 kiiii lScorelScorelScore

10

Variations and Applications

• Labeling document clusters– Document cluster unigram language model– Applicable to any task with unigram language model

• Context sensitive labels– Label of a topic is sensitive to the context– An alternative way to approach contextual text mining

tree, prune, root, branch “tree algorithms” in CS ? in horticulture

? in marketing?

11

Experiments

• Datasets:– SIGMOD abstracts; SIGIR abstracts; AP news data– Candidate labels: significant bigrams; NLP chunks

• Topic models:– PLSA, LDA

• Evaluation:– Human annotators to compare labels generated from

anonymous systems– Order of systems randomly perturbed; score average

over all sample topics

12

Result Summary

• Automatic phrase labels >> top words• 1-order relevance >> 0-order relevance• Bigram > NLP chunks

– Bigram works better with literature; NLP better with news

• System labels << human labels– Scientific literature is an easier task

13

Results: Sample Topic Labels

tree 0.09trees 0.08spatial 0.08b 0.05r 0.04disk 0.02array 0.01cache 0.01

north 0.02case 0.01trial 0.01iran 0.01documents 0.01walsh 0.009reagan 0.009charges 0.007

the, of, a, and,to, data, > 0.02…clustering 0.02time 0.01clusters 0.01databases 0.01large 0.01performance 0.01quality 0.005

clustering algorithmclustering structure

large data, data quality, high data,

data application, …

iran contra…

r treeb tree …

indexing methods

14

Results: Context-Sensitive Labeling

samplingestimationapproximationhistogramselectivityhistograms…

selectivity estimation;random sampling;

approximate answers;

distributed retrieval;parameter estimation;

mixture models;

Context: Database(SIGMOD Proceedings)

Context: IR(SIGIR Proceedings)

• Explore the different meaning of a topic with different contexts (content switch)

• An alternative approach to contextual text mining

15

Summary

• Labeling: A postprocessing step of all multinomial topic models

• A probabilistic approach to generate good labels– understandable, relevant, high coverage, discriminative

• Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive

• Future work:– Labeling hierarchical topic models– Incorporating priors

16

Thanks!- Please come to our poster tonight (#40)