Dual Active Feature and Sample Selection for Graph Classification

27
Dual Active Feature and Sample Selection for Graph Classification Xiangnan Kong 1 , Wei Fan 2 , Philip S. Yu 1 1 Department of Computer Science University of Illinois at Chicago 2 IBM T. J. Watson Research KDD 2011

description

Dual Active Feature and Sample Selection for Graph Classification. Xiangnan Kong 1 , Wei Fan 2 , Philip S. Yu 1. 1 Department of Computer Science University of Illinois at Chicago 2 IBM T. J. Watson Research. KDD 2011. Graph Classification. Traditional Classification:. Feature Vector. - PowerPoint PPT Presentation

Transcript of Dual Active Feature and Sample Selection for Graph Classification

Page 1: Dual Active Feature and Sample Selection for Graph Classification

Dual Active Feature and Sample Selection for Graph Classification

Xiangnan Kong1, Wei Fan2, Philip S. Yu1

1 Department of Computer Science University of Illinois at Chicago2 IBM T. J. Watson Research

KDD 2011

Page 2: Dual Active Feature and Sample Selection for Graph Classification

Graph Classification

Graph Classification:

Traditional Classification:

xFeature Vector

input label output

Graph Object

input label output

Page 3: Dual Active Feature and Sample Selection for Graph Classification

Cheminformatics: Drug Discovery

Training data Testing data

? ? ?

? ? ?+ + -

-- +

Chemical Compound

label

Anti-canceractivity

+/-

H

H HO C

HH HN

HHN

CCCCC

C

Graph Object

Page 4: Dual Active Feature and Sample Selection for Graph Classification

Applications:XML Documents

labelCategory

Program Flows

labelError?

System Call Graph

labelNormal softare/ Virus?

Page 5: Dual Active Feature and Sample Selection for Graph Classification

Graph ClassificationGiven a set of graph objects with class labels how to predict the labels of unlabeled graphs

Subgraph Feature Mining

Challenge:complex structurelack of features

Page 6: Dual Active Feature and Sample Selection for Graph Classification

HHNx1

x2

Subgraph Features

H

H H

OC

H

H

H

H

H

H

H

H1 0

0 1

1

1

Subgraph Features

H

G1

G2

F1 F2 F3H HN

HHN

OOC C

C CO

O

C

CC

CC

CC

CC

CCC

CCC C C

CCCC

C C C

Classifierx1 x2

Graph Objects

Feature Vectors

Classifiers

How to extract a set of subgraph features for a

graph classification?

Page 7: Dual Active Feature and Sample Selection for Graph Classification

Subgraph Feature SelectionExisting Methods

Mining discriminative subgraph features for a graph classification task

HHN

C

CC

CCC

+ + -

-- +F1 F2

Focused on supervised settings

Page 8: Dual Active Feature and Sample Selection for Graph Classification

Labeling Cost Supervised Settings

Require a large number of labeled graphs

Labeling cost is high ?We can only afford to label a few graph objects

-> Feature selection-> Classification Accuracy

Page 9: Dual Active Feature and Sample Selection for Graph Classification

Active Sample Selection Given a set of candidate graph

samples We want to select the most

important graph to query the label

? ? ?

? ? ?

+

+

-

Page 10: Dual Active Feature and Sample Selection for Graph Classification

Active Sample Selection Given a set of candidate graph

samples We want to select the most

important graph to query the label

?+

+

- ?

? ?

?

?

Page 11: Dual Active Feature and Sample Selection for Graph Classification

Two parts of the problem Active Sample Selection

select most important graph in the pool to query label

?

??

?

HHN

OOC C

OC

CCC

CCC

Subgraph Feature Selection Select relevant features

to the classification taskCorrelated !

Page 12: Dual Active Feature and Sample Selection for Graph Classification

Active Sample Selection

No feature

Subgraph enumeration is NP-hardRepresentati

veInformative

Page 13: Dual Active Feature and Sample Selection for Graph Classification

Active Sample Selection View

depend on which subgraph features are used

Page 14: Dual Active Feature and Sample Selection for Graph Classification

OC

Example

H

H H

OC

H

H

H

H

H

H

H

HH

G1 G2

F1

F2

H HN

HHN C C

OO

C

CC

CC

C

C

CC

CCC CC

C C C

CCCC

C C C

Graphs

Subgraph Features

Very Similar

Page 15: Dual Active Feature and Sample Selection for Graph Classification

HHN

H

H H

OC

H

H

H

H

H

H

H

HH

G1 G2

F1

F2

H HN

HHN

OOC C

C CO

O

C

CC

CCC CC

C C C

CCCC

C C CGraphs

Subgraph Features

Example

Very Different

Page 16: Dual Active Feature and Sample Selection for Graph Classification

Subgraph Feature Selection

Graph Object

Subgraph Feature Feature

Selection View

Active Sample SelectionView

Page 17: Dual Active Feature and Sample Selection for Graph Classification

Dual Active Feature and Sample Selection

Active SampleSelecti

onLabeled Graphs

+-

Unlabeled Graphs

??

?

Perform active sample selection & feature selection simultaneously

HHN C

CC

CCC

OOC C

Subgraph Feature Selection

Query & Label

Page 18: Dual Active Feature and Sample Selection for Graph Classification

gActive Method Max-min Active Sample Selection

Maximizing the Reward for querying a graph

+

-

Worst Case

min.

max.

query

Page 19: Dual Active Feature and Sample Selection for Graph Classification

gActive Method Dependence

Maximization Graphs’ features match with their labels

Informative Query graph far away from labeled graphs

Representative Query graph close to unlabeled graphs

Max-min Active Sample Selection Maximize the reward

Feature Selection Max. an utility function

+

Page 20: Dual Active Feature and Sample Selection for Graph Classification

Example:

+

-

More Details in the paper:Branch& Bound Subgraph Mining (speed up)

Page 21: Dual Active Feature and Sample Selection for Graph Classification

Anti-Cancer Activity datasets (NCI & AIDS)▪ Graph: chemical compounds▪ Label: anti-cancer activities

Experiments:Data Sets

balanced with 500 positive + 500 negative samples

Page 22: Dual Active Feature and Sample Selection for Graph Classification

Experiments:Compared Methods

Unsupervised feature selection + Random SamplingFreq. + Random frequent subgraphs + random query

Supervised feature selection + Random SamplingIG + Random information gain + random query

Unsupervised feature selection + Margin-based Freq. + Margin frequent subgraphs + close to marginUnsupervised feature selection + TEDFreq.+ TED frequent subgraphs + transductive experimental design

Supervised feature selection + Margin-baseIG + Margin information gain + close to margin

Dual active feature and sample selectiongActive the proposed method in this paper

Page 23: Dual Active Feature and Sample Selection for Graph Classification

Experiment Results

Page 24: Dual Active Feature and Sample Selection for Graph Classification

Experiment Results (NCI-47) Accuracy(higher is better)

# Queried Graphs ( #features=200, NCI-47 )

gActive Dual Active Feature & Sample selection

I.G. + Random

Freq. + MarginFreq. + TEDI.G. + Margin

Freq. + Random

Supervised < Unsupervised Supervised > Unsupervised

Page 25: Dual Active Feature and Sample Selection for Graph Classification

Experiment Results

Page 26: Dual Active Feature and Sample Selection for Graph Classification

Experiment Results

gActive wins consistently

Page 27: Dual Active Feature and Sample Selection for Graph Classification

Conclusions Dual Active Feature and Sample

Selection for Graph Classification Perform subgraph feature selection and active

sample selection simultaneously

Thank you!

Future works other data and applications

▪ itemset and sequence data