1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke...

19
1 Topic Distributions over Links on Web Jie Tang 1 , Jing Zhang 1 , Jeffrey Xu Yu 2 , Zi Yang 1 , Keke Cai 3 , Rui Ma 3 , Li Zhang 3 , and Zhong Su 3 1 Tsinghua University 2 Chinese University of Hong Kong 3 IBM, China Research Lab Dec. 7 th 2009

Transcript of 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke...

Page 1: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

1

Topic Distributions over Links on Web

Jie Tang1, Jing Zhang1, Jeffrey Xu Yu2, Zi Yang1, Keke

Cai3, Rui Ma3, Li Zhang3, and Zhong Su3

1 Tsinghua University2 Chinese University of Hong Kong

3 IBM, China Research LabDec. 7th 2009

Page 2: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

2

Motivation

• Web users create links with significantly different intentions

• Understanding of the category and the influence of each link can benefit many applications, e.g.,– Expert finding– Collaborator finding– New friends recommendation– …

Page 3: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

3

Original citation networkSemantic citation network

Examples – Topic distribution analysis over citations

Researcher A • an in-depth understanding of the

research field?

VS.

Self-Indexing Inverted Files for Fast Text Retrieval

StaticIndex Pruning for Information

Retrieval Systems

Signature les: An access Method for Documents and

its Analytical Performance Evaluation

FilteredDocument Retrieval with Frequency-Sorted Indexes

Vector-space Ranking with Effective Early Termination

Efficient Document Retrieval in Main Memory

A Document-centric Approach to Static Index Pruning in Text

Retrieval Systems

An Inverted Index Implementation

Parameterised Compression for Sparse Bitmaps

Introduction of Modern Information Retrieval

Memory Efficient Ranking

Topic 31: Ranking and Inverted Index

Topic 27: Information retrieval

Topic 1 : Theory

Topic 21: Framework

Topic 22: Compression

Other

Topic 23: Index method

Topic 34: Parallel computing

Basic theoryComparable workOther

Citation Relationship Type

Topics

Page 4: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

4

Problem: Link Semantic AnalysisTopic modeling

over linksCitation context words

Link semantics

Page 5: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

5

Outline

• Previous Work

• Our Approach– Pairwise Restricted Boltzmann Machines (PRBMs)

• Experimental Results

• Conclusion & Future Work

Page 6: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

6

Previous Work

Link influence analysis• Citation influence topic [Dietz, 07];• Social influence analysis [Crandall, 08; Tang,

09];

Graphical model• Probabilistic LSI [Hofmann, 99], • Latent Dirichlet Allocation [Blei, 03], • Restricted Boltzmann machines [Welling, 01]

Social network analysis• Social network analysis [Wasserman, 94]• Web community discovery [Newman, 04]• ‘Small world’ networks [Watts, 18]

Page 7: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

7

Outline

• Previous Work

• Our Approach– Pairwise Restricted Boltzmann Machines (PRBMs)

• Experimental Results

• Conclusion & Future Work

Page 8: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

8

Pairwise Restricted Boltzmann Machines (PRBMs)

Link context words

Topic distribution

Link category

Latent variables defined over the link to bridge the

two pages

Pairwise Restricted Boltzmann Machines (PRBMs)

Example

Page 9: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

9

Formalization of PRBMs

Formalization

PRBMs

Obj. Func:

with

Page 10: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

10

Model Learning

Generative learning

Discriminative learning

Hybrid learning

Obj. Func:

Expectation w.r.t. the data distribution

Expectation w.r.t. the distribution defined by the

model

We use the Contrast Divergence to learn the model distribution PM

Page 11: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

11

Link Semantic Analysis

• Link category annotation– First we calculate – Then we estimate the probability p(c|e) by a mean field

algorithm

• Link influence estimation– Estimate influence by KL divergence

– An alternative way is to generate the influence score by a Gaussian distribution, thus

Page 12: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

12

Outline

• Previous Work

• Our Approach– Pairwise Restricted Boltzmann Machines (PRBMs)

• Experimental Results

• Conclusion & Future Work

Page 13: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

13

Experimental Setting

• Data sets– Arnetminer data: 978,504 papers, 14M citations– Wikipedia: 14K “article” pages and 25 K links

• Evaluation measures – Link categorization accuracy– Topical analysis

• Baselines:– SVM+LDA– SVM+RBM

Page 14: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

14

Accuracy of Link Categorization

gPRBM: our approach with generative learning

dPRBM: our approach with discriminative learning

hPRBM: our approach with hybrid learning

Page 15: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

15

Category-Topic Mixture

Page 16: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

16

Example Analysis

Page 17: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

17

Outline

• Previous Work

• Our Approach– Pairwise Restricted Boltzmann Machines (PRBMs)

• Experimental Results

• Conclusion & Future Work

Page 18: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

18

Conclusion & Future Work

• Concluding remarks– Investigate the problem of quantifying link semantics on the

Web

– Propose a Pairwise Restricted Boltzmann Machines to solve this problem

• Future Work– Semantic analysis over social relationships

– Correlation between the link semantics and the information propagation

Page 19: 1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.

19

Thanks!

Q&AHP: http://keg.cs.tsinghua.edu.cn/persons/tj/