Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning...

22
Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz, Elidan, Abbeel and Koller, JMLR 2008

Transcript of Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning...

Page 1: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Max-Margin Classification of Data with Absent Features

Presented by Chunping Wang

Machine Learning Group, Duke University

July 3, 2008

by Chechik, Heitz, Elidan, Abbeel and Koller, JMLR 2008

Page 2: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Outline

• Introduction

• Standard SVM

• Max-Margin Formulation for Missing Features

• Three Algorithms

• Experimental Results

• Conclusions

Page 3: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Introduction (1)

Pattern of missing features:

• due to measurement noise or corruption: existing but unknown

• due to the inherent properties of the instances: non-existing

Example 1: Two subpopulation of instances (animals and buildings) with few overlapping features (body parts, architectural aspects );

Example 2: In a web-page task, one useful feature of a given page may be the most common topic of other sites that point to it, however, this particular page may have no such parents.

Page 4: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Introduction (2)

Common methods for handling missing features:

(Assume the features exist but their values are unknown)

• Single imputation: zeros, mean, kNN

• imputation by building probabilistic generative models

Proposed method (Assume the features are structurally absent) :

Each data instance resides in a lower dimensional subspace of the feature space, determined by its own existing features. We try to maximize the worst-case margin of the separating hyperplane, while measuring the margin of each data instance in its own lower-dimensional subspace.

Page 5: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Standard SVM (1) Binary classification

di Rxreal-valued predictors

binary response }1,1{ iy

bf T xwx)(

A classifier could be defined as

based on a linear function

0)( xf

w

||||||

wb

Parameters1),( dRbw

)]([sign xfy

Page 6: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Standard SVM (2) )()( by i

Tii xwwFunctional margin for each instance

||||/)()( wxww by iT

ii Geometric margin for each instance

Geometric margin of a hyper plane ),( bw

||||/)(min)(min)( wxwww by iT

ii

ii

SVM: )(max ww

by fixing the functional margin to 1, i.e., 1)(min by i

Ti

ixw

’s: slack variables

C: cost

Quadratic Programming (QP)

Page 7: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Max-Margin Formulation for Missing Features (1)

A 2-D case with missing data

1 margin in the subspace

2 margin in the full feature space

21

Margin of instances with missing features is underestimated.

Page 8: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Max-Margin Formulation for Missing Features (2)

Instance margin

is non-convex in w||||/ )()( ii

iiy wxw

|||| )(iw is instance dependent and thus cannot be taken out of the minimization

It is difficult to solve this optimization problem directly.

Optimization problem

Page 9: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Three Algorithms (1)

• A convex formulation for linearly separable case

Introduce a lower bound for

For a given , this is a second order cone program (SOCP), which is convex and can be solved efficiently.

To find the optimal , do a bisection search over .

Unfortunately, extending it to the non-separable case is difficult.

R

Page 10: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Three Algorithms (2)

Average norm: a convex approximation for non-separable case

define Get rid of the instance dependence

non-separable case

Page 11: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Three Algorithms (3) Geometric margin: an exact non-convex approach for non-separable case

define

non-separable case

QP for a given set of ’sis

Page 12: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Three Algorithms (4)

Pseudo-code

Geometric margin: the exact non-convex approach for non-separable case

The convergence is not always guaranteed. Cross validation is used to choose an early stopping point.

Page 13: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Experimental Results (1) Zero. Missing values were set to zero.

Mean. Missing values were set to the average value of the feature over all data.

Flag. Additional features (“flags”) were added, explicitly denoting whether a feature is missing for a given instance.

kNN. Missing features were set with the mean value obtained from the K

nearest neighbors instances.

EM. A Gaussian mixture model is learned by iterating between (1) learning a GMM model of the filled data and (2) re-filling missing values using cluster

means, weighted by the posterior probability that a cluster generated the sample.

Averaged norm (avg |w|). Proposed approximate convex approach.

Geometric margin (geom). Proposed exact non-convex approach.

Page 14: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Experimental Results (2) UCI data sets (missing at random)

Remove 90% of the features of each sample randomly

Remove a patch covered 25% of pixels with location of the patch uniformly sampled.

Digits 5 & 6 from MNIST

Page 15: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Experimental Results (3) Visual object recognition

Task: to determine an automobile is present in a given image or not.

Local edge information Generative

model

Likelihood of patches to match each of 19 landmarks

Set a threshold

(Up to 10) Candidate patches (21-by-21 pixels) for landmarks

PCA

First 10 principal components for each patch

concatenate

A feature vector (up to 1900 features)

If the number of candidates for a given landmark is less than ten, we consider the rest to be structurally absent

Page 16: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Experimental Results (4)

An example image: the best 5 candidates matched to the front windshield landmark

Page 17: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Experimental Results (5)

Page 18: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Experimental Results (6) Metabolic pathway reconstruction

A fragment of the full metabolic pathway network

Arrows: chemical reactions

Purple boxed names: enzymes

Page 19: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Experimental Results (7)

Three types of neighborhood relations between enzyme pairs:

Linear chains (ARO7, PHA2)

Forks (TRP2, ARO7): same input, different outputs

Funnels (ARO9, PHA2): same output, different inputs

One feature vector (represents an enzyme)

Features for linear chain neighbor

Features for fork neighbor

Features for funnel neighbor

A feature vector will have structurally missing entries if the enzyme does not have all types of neighbors, e.g., PHA2 does not have a neighbor of type fork.

Page 20: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Experimental Results (8)

Task: to identify if a candidate enzyme is in the right “neighborhood”.

Data creation:

Positive samples: from the reactions with known enzymes (in the right “neighborhood”);

Negative samples: for each positive sample, replace the true enzyme with a random impostor, and calculate the features in such a wrong “neighborhood”. The impostor was uniformly chosen from the set of other enzymes.

Page 21: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Experimental Results (9)

Page 22: Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Conclusions

1. The authors presented a modified SVM model for max-margin training of classifiers in the presence of missing features, where the pattern of missing features is an inherent part of the domain.

2. The authors directly classified instances by skipping the non-existing features, rather than filling them with hypothetical values.

3. The proposed model was competitive with a range of single imputation approaches when tested in missing-at-random (MAR) settings.

4. One variant (geometric margin) significantly outperformed other methods in two real problems with non-existing features.