Dual Active Feature and Sample Selection for Graph Classification
description
Transcript of Dual Active Feature and Sample Selection for Graph Classification
Dual Active Feature and Sample Selection for Graph Classification
Xiangnan Kong1, Wei Fan2, Philip S. Yu1
1 Department of Computer Science University of Illinois at Chicago2 IBM T. J. Watson Research
KDD 2011
Graph Classification
Graph Classification:
Traditional Classification:
xFeature Vector
input label output
Graph Object
input label output
Cheminformatics: Drug Discovery
Training data Testing data
? ? ?
? ? ?+ + -
-- +
Chemical Compound
label
Anti-canceractivity
+/-
H
H HO C
HH HN
HHN
CCCCC
C
Graph Object
Applications:XML Documents
labelCategory
Program Flows
labelError?
System Call Graph
labelNormal softare/ Virus?
Graph ClassificationGiven a set of graph objects with class labels how to predict the labels of unlabeled graphs
Subgraph Feature Mining
Challenge:complex structurelack of features
HHNx1
x2
Subgraph Features
H
H H
OC
H
H
H
H
H
H
H
H1 0
0 1
1
1
Subgraph Features
…
…
…
H
G1
G2
F1 F2 F3H HN
HHN
OOC C
C CO
O
C
CC
CC
CC
CC
CCC
CCC C C
CCCC
C C C
Classifierx1 x2
Graph Objects
Feature Vectors
Classifiers
How to extract a set of subgraph features for a
graph classification?
Subgraph Feature SelectionExisting Methods
Mining discriminative subgraph features for a graph classification task
HHN
C
CC
CCC
+ + -
-- +F1 F2
Focused on supervised settings
Labeling Cost Supervised Settings
Require a large number of labeled graphs
Labeling cost is high ?We can only afford to label a few graph objects
-> Feature selection-> Classification Accuracy
Active Sample Selection Given a set of candidate graph
samples We want to select the most
important graph to query the label
? ? ?
? ? ?
+
+
-
Active Sample Selection Given a set of candidate graph
samples We want to select the most
important graph to query the label
?+
+
- ?
? ?
?
?
Two parts of the problem Active Sample Selection
select most important graph in the pool to query label
?
??
?
HHN
OOC C
OC
CCC
CCC
Subgraph Feature Selection Select relevant features
to the classification taskCorrelated !
Active Sample Selection
No feature
Subgraph enumeration is NP-hardRepresentati
veInformative
Active Sample Selection View
depend on which subgraph features are used
OC
Example
H
H H
OC
H
H
H
H
H
H
H
HH
G1 G2
F1
F2
H HN
HHN C C
OO
C
CC
CC
C
C
CC
CCC CC
C C C
CCCC
C C C
Graphs
Subgraph Features
Very Similar
HHN
H
H H
OC
H
H
H
H
H
H
H
HH
G1 G2
F1
F2
H HN
HHN
OOC C
C CO
O
C
CC
CCC CC
C C C
CCCC
C C CGraphs
Subgraph Features
Example
Very Different
Subgraph Feature Selection
Graph Object
Subgraph Feature Feature
Selection View
Active Sample SelectionView
Dual Active Feature and Sample Selection
Active SampleSelecti
onLabeled Graphs
+-
Unlabeled Graphs
??
?
Perform active sample selection & feature selection simultaneously
HHN C
CC
CCC
OOC C
Subgraph Feature Selection
Query & Label
gActive Method Max-min Active Sample Selection
Maximizing the Reward for querying a graph
+
-
Worst Case
min.
max.
query
gActive Method Dependence
Maximization Graphs’ features match with their labels
Informative Query graph far away from labeled graphs
Representative Query graph close to unlabeled graphs
Max-min Active Sample Selection Maximize the reward
Feature Selection Max. an utility function
+
Example:
+
-
More Details in the paper:Branch& Bound Subgraph Mining (speed up)
Anti-Cancer Activity datasets (NCI & AIDS)▪ Graph: chemical compounds▪ Label: anti-cancer activities
Experiments:Data Sets
balanced with 500 positive + 500 negative samples
Experiments:Compared Methods
Unsupervised feature selection + Random SamplingFreq. + Random frequent subgraphs + random query
Supervised feature selection + Random SamplingIG + Random information gain + random query
Unsupervised feature selection + Margin-based Freq. + Margin frequent subgraphs + close to marginUnsupervised feature selection + TEDFreq.+ TED frequent subgraphs + transductive experimental design
Supervised feature selection + Margin-baseIG + Margin information gain + close to margin
Dual active feature and sample selectiongActive the proposed method in this paper
Experiment Results
Experiment Results (NCI-47) Accuracy(higher is better)
# Queried Graphs ( #features=200, NCI-47 )
gActive Dual Active Feature & Sample selection
I.G. + Random
Freq. + MarginFreq. + TEDI.G. + Margin
Freq. + Random
Supervised < Unsupervised Supervised > Unsupervised
Experiment Results
Experiment Results
gActive wins consistently
Conclusions Dual Active Feature and Sample
Selection for Graph Classification Perform subgraph feature selection and active
sample selection simultaneously
Thank you!
Future works other data and applications
▪ itemset and sequence data