Feature selection methods

36
Feature selection methods Isabelle Guyon [email protected] IPAM summer school on Mathematics in Brain Imaging. July 2008

description

Feature selection methods. Isabelle Guyon [email protected]. IPAM summer school on Mathematics in Brain Imaging. July 2008. Feature Selection. n’. - PowerPoint PPT Presentation

Transcript of Feature selection methods

Page 1: Feature selection methods

Feature selection methods

Isabelle Guyon [email protected]

IPAM summer school on Mathematics in Brain Imaging. July 2008

Page 2: Feature selection methods

Feature Selection

• Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines.

X

n

m

n’

Page 3: Feature selection methods

Applications

Bioinformatics

Quality control

Machine vision

Customer knowledge

variables/features

examples

10

102

103

104

105

OCRHWR

MarketAnalysis

TextCategorization

Syst

em d

iagn

osis

10 102 103 104 105

106

Page 4: Feature selection methods

Nomenclature

• Univariate method: considers one variable (feature) at a time.

• Multivariate method: considers subsets of variables (features) together.

• Filter method: ranks features or feature subsets independently of the predictor (classifier).

• Wrapper method: uses a classifier to assess features or feature subsets.

Page 5: Feature selection methods

Univariate Filter

Methods

Page 6: Feature selection methods

Univariate feature ranking

• Normally distributed classes, equal variance 2 unknown; estimated from data as 2

within.

• Null hypothesis H0: + = -

• T statistic: If H0 is true,

t= (+ - -)/(withinm++1/m-Studentm++m--d.f.

-1

- +

- +

P(Xi|Y=-1)P(Xi|Y=1)

xi

Page 7: Feature selection methods

• H0: X and Y are independent.

• Relevance index test statistic.

• Pvalue false positive rate FPR = nfp / nirr

• Multiple testing problem: use Bonferroni correction pval n pval

• False discovery rate: FDR = nfp / nsc FPR n/nsc

• Probe method: FPR nsp/np

pvalr0 r

Null distribution

Statistical tests ( chap. 2)

Page 8: Feature selection methods

Univariate Dependence

• Independence:P(X, Y) = P(X) P(Y)

• Measure of dependence:

MI(X, Y) = P(X,Y) log dX dY

= KL( P(X,Y) || P(X)P(Y) )

P(X,Y)P(X)P(Y)

Page 9: Feature selection methods

A choice of feature selection ranking methods depending on the nature of:

• the variables and the target (binary, categorical, continuous)

• the problem (dependencies between variables, linear/non-linear relationships between variables and target)

• the available data (number of examples and number of variables, noise in data)

• the available tabulated statistics.

Other criteria ( chap. 3)

Page 10: Feature selection methods

Multivariate Methods

Page 11: Feature selection methods

Univariate selection may fail

Guyon-Elisseeff, JMLR 2004; Springer 2006

Page 12: Feature selection methods

Filters,Wrappers, andEmbedded methods

All features FilterFeature subset Predictor

All features

Wrapper

Multiple Feature subsets

Predictor

All features Embedded method

Feature subset

Predictor

Page 13: Feature selection methods

Relief

nearest hit

nearest miss

Dhit Dmiss

Relief=<Dmiss/Dhit>

Dhit

Dmiss

Kira and Rendell, 1992

Page 14: Feature selection methods

Wrappers for feature selection

N features, 2N possible feature subsets!

Kohavi-John, 1997

Page 15: Feature selection methods

• Exhaustive search.• Simulated annealing, genetic algorithms.• Beam search: keep k best path at each step. • Greedy search: forward selection or backward

elimination.• PTA(l,r): plus l , take away r – at each step, run

SFS l times then SBS r times.• Floating search (SFFS and SBFS): One step of

SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far. Any time, if a better subset of the same size was already found, switch abruptly.

n-kg

Search Strategies ( chap. 4)

Page 16: Feature selection methods

Forward Selection (wrapper)

n

n-1

n-2

1

Start

Also referred to as SFS: Sequential Forward Selection

Page 17: Feature selection methods

Guided search: we do not consider alternative paths.Typical ex.: Gram-Schmidt orthog. and tree classifiers.

Forward Selection (embedded)

Start

n

n-1

n-2

1

Page 18: Feature selection methods

Backward Elimination (wrapper)

1

n-2

n-1

n

Start

Also referred to as SBS: Sequential Backward Selection

Page 19: Feature selection methods

Backward Elimination (embedded)

Start

1

n-2

n-1

n

Guided search: we do not consider alternative paths.Typical ex.: “recursive feature elimination” RFE-SVM.

Page 20: Feature selection methods

Scaling Factors Idea: Transform a discrete space into a continuous space.

• Discrete indicators of feature presence: i {0, 1}

• Continuous scaling factors: i IR

=[1, 2, 3, 4]

Now we can do gradient descent!

Page 21: Feature selection methods

• Many learning algorithms are cast into a minimization of some regularized functional:

Empirical errorRegularization

capacity control

Formalism ( chap. 5)

Next few slides: André Elisseeff

Justification of RFE and many other embedded methods.

Page 22: Feature selection methods

Embedded method

• Embedded methods are a good inspiration to design new feature selection techniques for your own algorithms:– Find a functional that represents your prior knowledge about

what a good model is.– Add the weights into the functional and make sure it’s

either differentiable or you can perform a sensitivity analysis efficiently

– Optimize alternatively according to and – Use early stopping (validation set) or your own stopping

criterion to stop and select the subset of features

• Embedded methods are therefore not too far from wrapper techniques and can be extended to multiclass, regression, etc…

Page 23: Feature selection methods

The l1 SVM

• A version of SVM where (w)=||w||2 is replaced by the l1 norm (w)=i |wi|

• Can be considered an embedded feature selection method:– Some weights will be drawn to zero (tend

to remove redundant features)– Difference from the regular SVM where

redundant features are included

Bi et al 2003, Zhu et al, 2003

Page 24: Feature selection methods

• Replace the regularizer ||w||2 by the l0 norm

• Further replace by i log( + |wi|)

• Boils down to the following multiplicative update algorithm:

The l0 SVM

Weston et al, 2003

Page 25: Feature selection methods

Causality

Page 26: Feature selection methods

What can go wrong?

Guyon-Aliferis-Elisseeff, 2007

X2 X1

180 190 200 210 220 230 240 250 260

20

40

60

80

100

120

Page 27: Feature selection methods

What can go wrong?

20 40 60 80 100

8

10

12

14

16

20

40

60

80

100

X2 X1

X1

X

2

X2 X1

180 190 200 210 220 230 240 250 260

20

40

60

80

100

120

Page 28: Feature selection methods

What can go wrong?

Guyon-Aliferis-Elisseeff, 2007

X2 Y

X1

Y

X1X2

Page 29: Feature selection methods

http://clopinet.com/causality

Page 30: Feature selection methods

Wrapping up

Page 31: Feature selection methods

Bilevel optimization

1) For each feature subset, train predictor on training data.

2) Select the feature subset, which performs best on validation data.– Repeat and average if you want

to reduce variance (cross-validation).

3) Test on test data.

N variables/features

M s

ampl

es

m1

m2

m3

Split data into 3 sets:training, validation, and test set.

Page 32: Feature selection methods

Complexity of Feature Selection

Method Number of subsets tried

Complexity C

Exhaustive search wrapper

2N N

Nested subsets Feature ranking

N(N+1)/2 or N

log N

Generalization_error Validation_error + (C/m2)

m2: number of validation examples, N: total number of features,n: feature subset size.

With high probability:

n

Error

Try to keep C of the order of m2.

Page 33: Feature selection methods

CLOP http://clopinet.com/CLOP/

• CLOP=Challenge Learning Object Package.• Based on the Matlab® Spider package developed at the

Max Planck Institute.• Two basic abstractions:

– Data object– Model object

• Typical script:– D = data(X,Y); % Data constructor– M = kridge; % Model constructor– [R, Mt] = train(M, D); % Train model=>Mt– Dt = data(Xt, Yt); % Test data constructor– Rt = test(Mt, Dt); % Test the model

Page 34: Feature selection methods

NIPS 2003 FS challenge

http://clopinet.com/isabelle/Projects/ETH/Feature_Selection_w_CLOP.html

Page 35: Feature selection methods

Conclusion

• Feature selection focuses on uncovering subsets of variables X1, X2,

… predictive of the target Y. • Multivariate feature selection is in

principle more powerful than univariate feature selection, but not always in practice.

• Taking a closer look at the type of dependencies in terms of causal relationships may help refining the notion of variable relevance.

Page 36: Feature selection methods

1) Feature Extraction, Foundations and ApplicationsI. Guyon et al, Eds.Springer, 2006.http://clopinet.com/fextract-book

2) Causal feature selectionI. Guyon, C. Aliferis, A. ElisseeffTo appear in “Computational Methods of Feature Selection”, Huan Liu and Hiroshi Motoda Eds., Chapman and Hall/CRC Press, 2007.http://clopinet.com/causality

Acknowledgements and references