Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...

Automatic Labeling of Multinomial Topic Models

Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai

University of Illinois at Urbana-Champaign

Outline

• Background: statistical topic models

• Labeling a topic model– Criteria and challenge

• Our approach: a probabilistic framework

• Experiments

• Summary

Statistical Topic Models for Text Mining

Text Collections

ProbabilisticTopic Modeling

…web 0.21search 0.10link 0.08 graph 0.05…

Subtopic discovery

Opinion comparison

Summarization

Topical pattern analysis

term 0.16relevance 0.08weight 0.07 feedback 0.04independ. 0.03model 0.03…

Topic models(Multinomial distributions)

PLSA [Hofmann 99]

LDA [Blei et al. 03]

Author-Topic [Steyvers et al. 04]

CPLSA [Mei & Zhai 06]

Pachinko allocation[Li & McCallum 06]

Topic over time[Wang et al. 06]

Topic Models: Hard to Interpret

• Use top words– automatic, but hard to make sense

• Human generated labels– Make sense, but cannot scale up

term 0.16relevance 0.08weight 0.07 feedback 0.04independence 0.03model 0.03frequent 0.02probabilistic 0.02document 0.02…

Retrieval Models

Question: Can we automatically generate understandable labels for topics?

Term, relevance, weight, feedback

insulin foragingforagerscollectedgrainsloadscollectionnectar…

What is a Good Label?

• Semantically close (relevance)• Understandable – phrases?• High coverage inside topic• Discriminative across topics• …

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

iPod Nano

Pseudo-feedback

Information Retrieval

Retrieval models

じょうほうけんさく

– Mei and Zhai 06: a topic in SIGIR

Our Method

Collection (e.g., SIGIR)

term 0.16relevance 0.07weight 0.07

feedback 0.04independence 0.03model 0.03…filtering 0.21collaborative 0.15… trec 0.18

evaluation 0.10…

NLP ChunkerNgram Stat.

information retrieval, retrieval model, index structure, relevance feedback, …

Candidate label pool1

Relevance Score

Information retrieval 0.26retrieval models 0.19IR models 0.17 pseudo feedback 0.06……

2 Discrimination3information retriev. 0.26 0.01retrieval models 0.20IR models 0.18 pseudo feedback 0.09……

4 Coverage

retrieval models 0.20IR models 0.18 0.02 pseudo feedback 0.09……information retrieval 0.01

Relevance (Task 2): the Zero-Order Score

• Intuition: prefer phrases well covering top words

Clustering

dimensional

algorithm

Latent Topic

Good Label (l1): “clustering algorithm”

Bad Label (l2): “body shape”

p(“clustering”|) = 0.4

p(“dimensional”|) = 0.3

p(“body”|) = 0.001

p(“shape”|) = 0.01

√>)lg(

orithmaclusteringp

shapebodyp

Clustering

dimension

algorithm

partition

p(w | clustering algorithm )

Good Label (l1)“clustering algorithm”

Clustering

dimension

algorithm…

p(w | hash join)

key …hash join… code …hashtable …search…hash join…

map key…hash…algorithm…key

…hash…keytable…join…

l2: “hash join”

Relevance (Task 2): the First-Order Score

• Intuition: prefer phrases with similar context (distribution)

Clustering

dimension

partition

algorithm

…P(w|)

ClwPMIwp )|,()|(

Score (l, ) = D(||l)

Discrimination and Coverage (Tasks 3 & 4)

• Discriminative across topic:– High relevance to target topic, low relevance to

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...

Documents

Transcript of Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University...

Yahoo!-DAIS Seminar (CS591DAI) Orientation ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois at Urbana-Champaign.

1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University.

Chengxiang, Song - Schnell Manifestieren - Der Trainingskurs.pdf

Adaptive Relevance Feedback in Information Retrieval Yuanhua Lv and ChengXiang Zhai (CIKM ‘09)

Incident-angle-insensitive and polarization independent ... fileIncident-angle-insensitive and polarization independent polarization rotator Mingkai Liu, Yanbing Zhang, Xuehua Wang,

CS276A Text Retrieval and Mining Lecture 12 [Utilizando slides de Viktor Lavrenko e Chengxiang Zhai]

1 Topic-Sentiment Mixture: Modeling Facets and Opinions in Weblogs Qiaozhu Mei †, Xu Ling †, Matthew Wondra †, Hang Su ‡, and ChengXiang Zhai † † University.

Statistical Topic Models for Integrating and Analyzing Opinions in Blog articles Yue Lu Qiaozhu Mei ChengXiang Zhai.

Formation of surface nanodroplets under controlled flow ... · Formation of surface nanodroplets under controlled flow conditions Xuehua Zhanga,b,1, Ziyang Lua, Huanshu Tana,b, Lei

Kai Zheng, PhD, Qiaozhu Mei, PhD, David A. Hanauer, MD University of Michigan

Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.

Anant Pradhan PET: A Statistical Model for Popular Events Tracking in Social Communities Cindy Xide Lin, Bo Zhao, Qiaozhu Mei, Jiawei Han (UIUC)

Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.

1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan.

2008 © ChengXiang Zhai 1 Introduction to Research ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign czhai,

Clustering Personalized Web Search Results Xuehua Shen and Hong Cheng.

Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.

Query Suggestion Using Hitting Time Qiaozhu Mei, Dengyong Zhou, Kenneth Church University of Illinois at Urbana-Champaign Microsoft Research, Redmond.

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1 Probabilistic Topic Models for Text Mining ChengXiang Zhai ( 翟成祥 ) Department.

Chengxiang Qi