Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max...

Graph-based Text Classification: Learn from Your Neighbors

Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg 85 66123, Saarbrücken, Germany

Present by Chia-Hao Lee

2

outline

• Introduction • Graph-based Classification • Incorporating Metric Label Distances • Experimental• Conclusion

3

Introduction

• Automatic classification is a supervised learning technique for assigning thematic categories to data items such as customer records, gene-expression data records, Web pages, or text documents.

• The standard approach is to represent each data item by a feature vector and learn parameters of mathematical decision models.

• Context-free : the decision is based only on the feature vector of a given data item, disregarding the other data items in the test set.

4

Introduction

• In many settings, this “context-free” approach does not exploit the available information about relationships between data items.

• Using the relationship information, we can construct a graph G in which each data item is a node and each relationship instance forms an edge between the corresponding nodes.

• In the following we will mostly focus on text documents with links to and from other documents.

5

Introduction

• A straightforward approach to capturing a document’s neighbors would be to incorporate the features and feature weights of the neighbors into the feature vector of the given document itself.

• A more advanced approach is to model the mutual influence between neighboring documents, aiming to estimate the class labels of all test documents simultaneously.

6

Introduction

• A simple example for RL (Relaxation labeling) is shown in figure 1.

• Let our set of class be .• We wish to assign to every document marked “?” its

most probable label.• Let the contingency matrix in figure 1b) be estimated

from the training data.

7

Introduction

• The theory paper by Kleinberg and Tardos views the classification problem for nodes in an undirected graph as a metric labeling problem where we aim to optimize a combinatorial function consisting of assignment costs and separation costs.

8

Graph-Based Classification

• Our approach is based on the probabilistic formulation of the classification problem and uses a relaxation labeling technique to derive two major approaches for finding the maximally likely labeling λ of the given test graph: hard and soft labeling.

• D : a set of documents

G : a graph whose vertices correspond to documents

and edges represent the link structure of D.

: the label of node u.

: the feature vector that locally captures the content of

document d.

Dd

u d

9


• Taking into account the underlying link structure and document d’s context-based feature vector, the probability of a label to be assigned to d is :

• In the spirit of the introduction’s discuss on emphasizing the influence of the immediate neighbors for each document, ,we obtain

and denote it by . • The independent of the labels of other nodes in the grap

h given the labels of its immediate neighbors. We abbreviate into .

Cc ldddcdGcd ,,,Pr,Pr 1

dcd Pr dc,

dN dNdcdGdcd ,Pr,Pr

dc,

10


• We abbreviate ,the graph-unaware probability based only on d’s local content, by .

• The additional independence assumption that there is no direct of its coupling between the content of a document and the labels of its neighbors, the following central equation holds for the total probability , summing up the posterior probabilities for all possible labelings of the neighborhood:

dcd Pr

dc,

dc,

dN

dcdc dNdNcd

PrPr,,

11


• In the same vein, if we further assume independence among all neighbor labels of the same node, we reach the following formulation for our neighborhood-conscious classification problem:

• This can be computed in an iterative manner as follow:

dN dNddcdc cdcd

Pr,,

dN

r

dNddc

rdc cdcd

1

,, Pr

dN

r

dNdccdc

rdc

1

,,,

12


• Hard labeling :

In contrast to the presented soft labeling approach, we also consider a method that take into account only the most probable label assignments in the test document neighborhood to be significant for the

computation.

Let be the maximum probable label :

rrdc dNcd ,Pr,

maxc

1max Prmaxarg

rc cdc

13


• Soft Labeling :

The soft labeling approach aims to achieve better accuracy of the classification by avoiding the overly eager “rounding” that the hard labeling approach does.

14

Incorporating Metric Label Distance

• Intuitively, neighboring documents should receive similar class labels.

• For example, suppose we have a set of classes

and we wish to find the most probable label for a test document d.

• A document discussing scientific problems (S) would be much farther away from both C and E.

• So, a similarity metric imposed on the set of labels C would have high values for the pair (C,E) and small values for class pairs (C,S) and (E,S).

SScienceEentEntertainmCCultureC ,,

,

15


• This is why introducing a metric should help improve the classification result. In this metric, similar classes are separated by a shorter distance and impose smaller separation cost on an edge labeling.

• Our approach, on the other hand, is general, and we construct the metric Γ automatically from the training data.

• We incorporate the label metric into the iterations for computing the probability of an edge labeling by treating as a scaling factor.

,

ji cc , ji cc ,

16


• This way, we magnify the impact of edges between nodes with similar labels and scale down the impact of edges between dissimilar ones:

d dN Nde

idddc

idc ddw

,1

,,,

17

Experiments

• We have tested our graph-based classifier on three different data sets.

• The first one includes approximately 16000scientific publications chosen from the DBLP database.

• The second dataset has been selected from the internet movie database IMDB.

• The third dataset used in the experiments was the online encyclopedia Wikipedia.

18

Experiments

19

Experiments

20

Experiments

21

Experiments

22

Experiments

23

Conclusion

• The presented GC method for graph-based classification is a way of exploiting context relationships of data items.

• Incorporating metric distances among different labels contributed to the very good performance of GC method.

• This is one new form of exploiting knowledge about the relationships among category labels and thus the structure of the classifier’s target space.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max...

Documents

Transcript of Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max...