Dual Active Feature and Sample Selection for Graph Classification

Post on 22-Feb-2016

81 views 0 download

Tags:

description

Dual Active Feature and Sample Selection for Graph Classification. Xiangnan Kong 1 , Wei Fan 2 , Philip S. Yu 1. 1 Department of Computer Science University of Illinois at Chicago 2 IBM T. J. Watson Research. KDD 2011. Graph Classification. Traditional Classification:. Feature Vector. - PowerPoint PPT Presentation

Transcript of Dual Active Feature and Sample Selection for Graph Classification

Dual Active Feature and Sample Selection for Graph Classification

Xiangnan Kong1, Wei Fan2, Philip S. Yu1

1 Department of Computer Science University of Illinois at Chicago2 IBM T. J. Watson Research

KDD 2011

Graph Classification

Graph Classification:

Traditional Classification:

xFeature Vector

input label output

Graph Object

input label output

Cheminformatics: Drug Discovery

Training data Testing data

? ? ?

? ? ?+ + -

-- +

Chemical Compound

label

Anti-canceractivity

+/-

H

H HO C

HH HN

HHN

CCCCC

C

Graph Object

Applications:XML Documents

labelCategory

Program Flows

labelError?

System Call Graph

labelNormal softare/ Virus?

Graph ClassificationGiven a set of graph objects with class labels how to predict the labels of unlabeled graphs

Subgraph Feature Mining

Challenge:complex structurelack of features

HHNx1

x2

Subgraph Features

H

H H

OC

H

H

H

H

H

H

H

H1 0

0 1

1

1

Subgraph Features

H

G1

G2

F1 F2 F3H HN

HHN

OOC C

C CO

O

C

CC

CC

CC

CC

CCC

CCC C C

CCCC

C C C

Classifierx1 x2

Graph Objects

Feature Vectors

Classifiers

How to extract a set of subgraph features for a

graph classification?

Subgraph Feature SelectionExisting Methods

Mining discriminative subgraph features for a graph classification task

HHN

C

CC

CCC

+ + -

-- +F1 F2

Focused on supervised settings

Labeling Cost Supervised Settings

Require a large number of labeled graphs

Labeling cost is high ?We can only afford to label a few graph objects

-> Feature selection-> Classification Accuracy

Active Sample Selection Given a set of candidate graph

samples We want to select the most

important graph to query the label

? ? ?

? ? ?

+

+

-

Active Sample Selection Given a set of candidate graph

samples We want to select the most

important graph to query the label

?+

+

- ?

? ?

?

?

Two parts of the problem Active Sample Selection

select most important graph in the pool to query label

?

??

?

HHN

OOC C

OC

CCC

CCC

Subgraph Feature Selection Select relevant features

to the classification taskCorrelated !

Active Sample Selection

No feature

Subgraph enumeration is NP-hardRepresentati

veInformative

Active Sample Selection View

depend on which subgraph features are used

OC

Example

H

H H

OC

H

H

H

H

H

H

H

HH

G1 G2

F1

F2

H HN

HHN C C

OO

C

CC

CC

C

C

CC

CCC CC

C C C

CCCC

C C C

Graphs

Subgraph Features

Very Similar

HHN

H

H H

OC

H

H

H

H

H

H

H

HH

G1 G2

F1

F2

H HN

HHN

OOC C

C CO

O

C

CC

CCC CC

C C C

CCCC

C C CGraphs

Subgraph Features

Example

Very Different

Subgraph Feature Selection

Graph Object

Subgraph Feature Feature

Selection View

Active Sample SelectionView

Dual Active Feature and Sample Selection

Active SampleSelecti

onLabeled Graphs

+-

Unlabeled Graphs

??

?

Perform active sample selection & feature selection simultaneously

HHN C

CC

CCC

OOC C

Subgraph Feature Selection

Query & Label

gActive Method Max-min Active Sample Selection

Maximizing the Reward for querying a graph

+

-

Worst Case

min.

max.

query

gActive Method Dependence

Maximization Graphs’ features match with their labels

Informative Query graph far away from labeled graphs

Representative Query graph close to unlabeled graphs

Max-min Active Sample Selection Maximize the reward

Feature Selection Max. an utility function

+

Example:

+

-

More Details in the paper:Branch& Bound Subgraph Mining (speed up)

Anti-Cancer Activity datasets (NCI & AIDS)▪ Graph: chemical compounds▪ Label: anti-cancer activities

Experiments:Data Sets

balanced with 500 positive + 500 negative samples

Experiments:Compared Methods

Unsupervised feature selection + Random SamplingFreq. + Random frequent subgraphs + random query

Supervised feature selection + Random SamplingIG + Random information gain + random query

Unsupervised feature selection + Margin-based Freq. + Margin frequent subgraphs + close to marginUnsupervised feature selection + TEDFreq.+ TED frequent subgraphs + transductive experimental design

Supervised feature selection + Margin-baseIG + Margin information gain + close to margin

Dual active feature and sample selectiongActive the proposed method in this paper

Experiment Results

Experiment Results (NCI-47) Accuracy(higher is better)

# Queried Graphs ( #features=200, NCI-47 )

gActive Dual Active Feature & Sample selection

I.G. + Random

Freq. + MarginFreq. + TEDI.G. + Margin

Freq. + Random

Supervised < Unsupervised Supervised > Unsupervised

Experiment Results

Experiment Results

gActive wins consistently

Conclusions Dual Active Feature and Sample

Selection for Graph Classification Perform subgraph feature selection and active

sample selection simultaneously

Thank you!

Future works other data and applications

▪ itemset and sequence data