KDD 2011 Summary of Text Mining sessions Hongbo Deng.

KDD 2011 Summary of Text Mining sessions

Hongbo Deng

3 Text Mining Sessions, 9 Papers• Beyond Keyword Search: Discovering Relevant Scientific Literature

– Khalid El-Arini (Carnegie Mellon University), Carlos Guestrin• Collaborative Topic Modeling for Recommending Scientific Articles

– Chong Wang (Princeton University), David M. Blei • Partially Labeled Topic Models for Interpretable Text Mining

– Daniel Ramage (Stanford University), Christopher D. Manning, Susan Dumais

• Refining Causality: Who Copied from Whom? – Tristan Snowsill (University of Bristol), Nick Fyson, Tijl De Bie, Nello Cristianini

• Conditional Topical Coding: An Efficient Topic Model Conditioned on Rich Features – Jun Zhu (Carnegie Mellon University), Ni Lao, Ning Chen, Eric P. Xing

• Tracking Trends: Incorporating Term Volume into Temporal Topic Models– Liangjie Hong (Lehigh University), Dawei Yin, Jian Guo, Brian D. Davison

• Latent Topic Feedback for Information Retrieval– David Andrzejewski (Lawrence Livermore National Laboratory), David Buttler

• Localized Factor Models for Multi-Context Recommendation– Deepak Agarwal (Yahoo! Labs), Bee-Chung Chen, Bo Long

• Latent Aspect Rating Analysis without Aspect Keyword Supervision– Hongning Wang (University of Illinois at Urbana-Champaign), Yue Lu, ChengXiang Zhai

Topic Model

Recommendation

Topic models are widely used in other sessions, e.g., user modeling, query log analysis, ad …

Collaborative Topic Modeling for Recommending Scientific Articles

• Problem:– To recommend scientific articles to users of an online community

• Input: – Users’s libraries from CiteULike– The content of the articles

• Output: – Find articles relevant to their interests

• Three traditional ways– Follow citations in other articles they are interested in– Keyword search– Using recommendation methods (CiteULike)

• Several criteria– Recommending older articles is important– Recommending new articles is also important– Exploratory variables can be valuable in online scientific archives and communities

Collaborative Filtering+

Topic Modeling


• Two types of data– The other users’ libraries [collaborative filtering]

• Like latent factor models, use information from other users’ libraries

• For a particular user, it can recommend articles from other users who liked similar articles

• Latent factor models work well for recommending known articles, but cannot generalize to previously unseen articles

– The content of the articles [topic modeling]• To generalize to unseen articles, the authors uses topic modeling• Can recommend articles that have similar content to other

articles that a user likes


• Intuition: Combine collaborative filtering and probabilistic topic modeling for recommending scientific articles

The key property in CTR lies in how the item latent vector $v_j$ is generated

We assume the item latent vector $v_j$ is close to topic proportion $\theta_j$, but could diverge from it if it has to

Latent Topic Feedback for Information Retrieval

• Problem: a user navigation an unfamiliar corpus of text documents where document metadata is limited or unavailable

• Intuition: To augment keyword search with user feedback on latent topics

• Key point: A new method for obtaining and exploiting user feedback at the latent topic level

Latent Topic Feedback for Information Retrieval

• Method: – To learn latent topics from the corpus and construct

meaningful representations of these topics– At query time, decide which latent topics are

potentially relevant and present the appropriate topic representations alongside keyword search results

– When a user selects a latent topic, the vocabulary terms most strongly associated with that topic are then used to augment the original query

Beyond Keyword Search: Discovering Relevant Scientific Literature

• Problem: As the number of publications has grown, difficult for scientists to find relevant prior work for their particular research

• Input: a set of papers as a query• Output: a set of highly relevant articles• Method:

– Modeling scientific influence between documents: optimize an objective function

– Select a set of papers A with maximum influence to/from the query set Q

– Incorporate trust and personalization: as scientists trust some authors more than others, results can be personalized to individual preferences

Partially Labeled Topic Models for Interpretable Text Mining

• Problem: make use of the unsupervised learning of topic modeling, with constrains that align some learned topics with a human-provided label

• Input: a collection of documents, partial labels

Graphical model for PLDA

Observed: each document’s words w and labels Λ

per-doc label distribution per-topic word distribution

per-doc-label topic distribution

• Output: θ, Φ, ψ

Extend the generative story of LDA to incorporate labels, and of Labeled LDA to incorporate per-label latent topics

a multinomial distribution over words $V$ that tend to co-occur with each other and some label

Latent Aspect Rating Analysis without Aspect Keyword Supervision

Reviews + overall ratings Aspect segmentslocation:1amazing:1walk:1anywhere:1

0.11.70.13.9

nice:1accommodating:1smile:1friendliness:1attentiveness:1

Term Weights Aspect Rating

0.02.90.10.9

room:1nicely:1appointed:1comfortable:1

2.11.21.72.20.6

Aspect Segmentation Latent Rating Regression

3.9

4.8

5.8

Aspect Weight

0.2

0.2

0.6

+

Gap ???

Latent Aspect Rating Analysis without Aspect Keyword Supervision

• LARAM

• Jointly model aspects and aspect rating/weights

• LRR (Wang et al., 2010)

• Segmented aspects from previous step

Some Observations

• Text mining is very hot• Topic modeling has been widely used in text

analysis or many other applications, e.g., query understanding, advertisement …– Combine topic modeling with other models, e.g.,

collaborative filtering– Integrate more information into topic modeling, e.g.,

labeled and unlabeled information (partially labeled) – Two-step solution -> unified way

Thanks!

KDD 2011 Summary of Text Mining sessions Hongbo Deng.

Documents

Transcript of KDD 2011 Summary of Text Mining sessions Hongbo Deng.