Variable and feature selection

Feature and Variable Selection in Classification

Aaron Karper

December 16, 2013

Abstract

The amount of information in the form of features and variables avail-able to machine learning algorithms is ever increasing. This can leadto classifiers that are prone to overfitting in high dimensions, high di-mensional models do not lend themselves to interpretable results, and theCPU and memory resources necessary to run on high-dimensional datasetsseverly limit the applications of the approaches.

Variable and feature selection aim to remedy this by finding a subsetof features that in some way captures the information provided best.

In this paper we present the general methodology and highlight somespecific approaches.

1 Introduction

As machine learning as a field develops, it becomes clear that the issue of findinggood features is often more difficult than the task of using the features to createa classification model. Often more features are available than can reasonablybe expected to be used.

1.1 Overfitting

One of the reasons why more features can actually hinder accuracy is that themore features we have, the less can we depend on measures of distance that manyclassifiers (e.g. SVM, linear regression, k-means, gaussian mixture models, . . . )require. This is known as the curse of dimensionality.

Accuracy might also be lost, because we are prone to overfit the model if itincorporates all the features.

Example. In a study of genetic cause of cancer, we might end up with 15participants with cancer and 15 without. Each participant has 21’000 geneexpressions. If we assume that any number of genes in combination can causecancer, even if we underestimate the number of possible genomes by assumingthe expressions to be binary, we end with 221

′000 possible models.In this huge number of possible models, there is bound to be one arbitrarily

complex that fits the observation perfectly, but has little to no predictive power[Russell et al., 1995, Chapter 18, Noise and Overfitting]. Would we in some waylimit the complexity of the model we fit, for example by discarding nearly allpossible variables, we would attain better generalisation.

1

1.2 Interpretability

If we take a classification task and want to gain some information from thetrained model, model complexity can hinder any insights. If we take up thegene example, a small model might actually show what proteins (produced bythe culprit genes) cause the cancer and this might lead to a treatment.

1.3 Computational complexity

Often the solution to a problem needs to fulfil certain time constraints. If arobot takes more than a second to classify a ball flying at it, it will not beable to catch it. If the problem is of a lower dimensionality, the computationalcomplexity goes dows as well.

Sometimes this is only relevant for the prediciton phase of the learner, butif the training is too complex, it might become infeasible.

1.3.1 Segmentation in Computer Vision

A domain that necessarily deals with a huge number of dimensions is computervision. Even only considering a VGA image, in which only the actual pixelvalues are taken into account gives 480× 640 = 307′200 datapoints.

For a segmentation task in document analysis, where pixels need to be clas-sified into regions like border, image, and text, there is more to be taken intoaccount than just the raw pixel values in order to incorporate spartial infor-mation, edges, etc. With up to 200 features to consider for each pixel, theevaluation would take too long to be useful.

1.4 Previous work

The article [Guyon and Elisseeff, 2003] gives a broad introduction to featureselection and creation, but as ten years passed, the state-of-the-art moved on.In this paper we will discuss the conclusions of Guyon and Elisseeff in sections2 and 3, adding some intuitions about the way to arrive at them, and showingsome more recent developments in the field in section 4.

2 Classes of methods

Depending on the task to solve, different methodologies need to be considered[Guyon and Elisseeff, 2003]. They identify four general approaches:

Ranking orders the features according to some score.

Filters build a feature set according to some heuristic.

Wrappers build a feature set according to the predictive power of the classifier

Embedded methods learn the classification model and the feature selectionat the same time.

If the task is to predict as accurately as possible, an algorithm that has asafeguard against overfitting might be better than ranking. If a pipeline scenariois considered, something that treats the following phases as blackbox would be

2

Figure 1: A ranking procedure would find that both features are equally uselessto separate the data and would discard them. If taken together however thefeature would separate the data very well.

more useful. If even the time to reduce the dimensionality is valuable, a rankingwould help.

2.1 Ranking

Variables get a score in the ranking approach and the top n variables are se-lected. This has the advantage that n is simple to control and that the selectionruns in linear time.

In [Zhou et al., 2005], the authors try to find a discriminative subset of genesto find out whether a tumor is malignant or benign1. In order to prune thefeature base, they rank the variables according to the correlation to the classesand make a prelimitary selection, which discards most of the genes in order tospeed up the more sophisticated procedures to select the top 10 features.

There are however problems with this approach that show a basic limitationof ranking:

Take as an example two genes X and Y , so that if one is mutated the tumoris malignant, which we denote by M , but if both mutate, the changes cancel eachother out, so that no tumor grows. Each variable separately would be useless,because P (M = true|X = true) = P (Y = false), but P (M = true|X =true, Y = false) = 1 – if both variables were selected however, it would be a

1Acutally they classify the tumors into 3 to 5 classes.

3

0 1 2 3 4 51

0

1

2

3

4

5

6

Figure 2: Features might be identically distributed, but using both can reducevariance and thus confusion by a factor of

√n

perfectly discriminant feature. Since each on its own is useless, they would notrank high and would probably be discarded by the ranking procedure, as seenin figure 2.1.

2.2 Filters

While ranking approaches ignore the value that a variable can have in connectionwith another, filters select a subset of features according to some determinedcriterion. This criterion is independent of the classifier that is used after thefiltering step. On one hand this allows to only train the following classifier once,which again might be more cost-effective. On the other hand it also means thatonly some heuristics are available of how well the classifier will do afterwards.

Filtering methods typically try to reduce in-class variance and to boost inter-class distance. An example of this approach is a filter that would maximizethe correlation between the variable set and the classification, but minimizethe correlation between the variables themselves. This is under the heuristic,that variables, that correlate with each other don’t provide much additionalinformation compared to just taking one of them.

This is not necessarily the case: If the variable is noisy, a second, correlatedvariable can be used to get a better signal, as can be seen in figure 2.2.

A problem with the filtering approach is that the performance of the classifiermight not depend as much as we would hope on the proxy measure that we used

4

to find the subset. In this scenario it might be better to assess the accuracy ofthe classifier itself.

2.3 Wrappers

Wrappers allow to look at the classifier as a blackbox and therefore break thepipeline metaphor. They optimize some performance measure of the classifieras the objective function. While this gives superior results to the heuristics offilters, it also costs in computation time, since a classifier needs to be trainedeach time – though shortcuts might be available depending on the classifiertrained.

Wrappers are in large search procedures through feature subset space – theatomic movements are to add or to remove a certain feature. This means thatmany combinatorical optimization procedures can be applied, such as simulatedannealing, branch-and-bound, etc. The start can either be the full feature set,where we try to reduce the number of features in an optimal way (backwardelimination) or we can start with no features and add them in a smart way(forward selection).

2.4 Embedded

Wrappers treated classifiers as a black box, therefore a combinatorical optimiza-tion was necessary with a training in each step of the search. If the classifierallows feature selection as a part of the learning step, the learning needs to bedone only once and often more efficiently.

A simple way that allows this is to optimize in the classifier not only forthe likelihood of the data, but instead for the posterior probability (MAP) forsome prior on the model, that makes less complex models more probable. Anexample for this can be found in section 4.2.

Somewhat similar is SVM with a `1 weigth constraint 2. The 1:1 exchangemeans that non-discriminative variables will end up with a 0 weight. It is alsopossible to take this a step further by optimizing for the number of variablesdirectly, since l0(w) = limp→0 lp(w) is exactly the number of non-zero variablesin the vector.

3 Feature creation

In the previous chapter the distinction between variables and features was notnecessary, since both could be used as input to the classifier after feature selec-tion. In this section features is the vector offered to the classifier and variablesis the vector handed to the feature creation step, i.e. the raw inputs collected.For much the same reasons that motivated feature selection, feature creationfor a smaller number of features compared to the number of variables provided.

Essentially the information needs to be compressed in some way to bestored in fewer variables. Formally this can be expressed by mapping the high-dimensional space through the bottleneck, which we hope results in recoveringthe low dimensional concepts that created the high-dimensional representationin the first place. In any case it means that typical features are created, with

2`p(w) = ‖w‖p = p√∑

|wi|p

5

a similar intuition to efficient codes in compression: If a simple feature occursoften, giving it a representation will reduce the loss more than representing aless common feature. In fact, compression algorithms can be seen as a kind offeature creation[Argyriou et al., 2008].

This is also related to the idea of manifold learning: While the variable spaceis big, the actual space in that the variables vary is much smaller – a manifold3

of hidden variables embedded in the variable space.

Example. In [Torresani et al., 2008] the human body is modelled as a lowdimensional model by probabilistic principal component analyis: It is assumedthat the hidden variables are distributed as Gaussians in a low dimensionalspace that are then linearly mapped to the high dimensional space of positionsof pixels in an image. This allows them to learn typical positions that a humanbody can be in and with that track body shapes in 3d even if neither the camera,nor the human are fixed.

4 Current examples

4.1 Nested subset methods

In the nested subset methods the feature subset space is greedily examined byestimating the expected gain of adding one feature in forward selection or theexpected loss of removing one feature in backward selection. This estimation iscalled the objective function. If it is possible to examine the objective functionfor a classifier directly, a better performance is gained by embedding the searchprocedure with it. If that is not possible, training and evaluating the classifieris necessary in each step.

Example. Consider a model of a linear predictor p(y|x) with M input variablesneeding to be pruned to N input variables. This can be modeled by assertingthat the real variables x?i are taken from RN , but a linear transformation A ∈RN×M and a noise term ni = N (0, σ2

x) is added:

xi = Ax?i + ni

In a classification task4, we can model y = Ber(sigm(w · x?)).This can be seen as a generalisation of PCA5 to the case where the output

variable is taken into account. Standard supervised PCA assumes that the out-put is distributed as a gaussian distribution, which is a dangerous simplificationin the classification setting[Guo, 2008].

The procedure iterates over the eigenvectors of the natural parameters ofthe joint distribution of the input and the output and adds them if they showan improvement to the current model in order to capture the influence of theinput to the output optimally. If more than N variables are in the set, the onewith the least favorable score is dropped. The algorithms iterates some fixednumber of times over all features, so that hopefully the globally optimal featuresubset is found.

3A manifold is the mathematical generalization of a surface or a curve in 3D space: Some-thing smooth that can be mapped from a lower dimensional space.

4The optimization a free interpretation of [Guo, 2008]5Principal component analysis reduces the dimensions of the input variables by taking only

the directions of the largest eigenvalues.

6

4.2 Logistic regression using model complexity regulari-sation

In the paper Gene selection using logistic regressions based on AIC, BIC andMDL criteria[Zhou et al., 2005] by Zhou, Wang, and Dougherty, the authorsdescribe the problem of classifying the gene expressions that determine whethera tumor is part of a certain class (think malign versus benign). Since the featurevectors are huge (≈ 21′000 genes/dimensions in many expressions) and thereforethe chance of overfitting is high and the domain requires an interpretable result,they discuss feature selection.

For this, they choose an embedded method, namely a normalized form oflogistic regression, which we will describe in detail here:

Logistic regression can be understood as fitting

pw(x) =1

1 + ew·x= sigm(w · x)

with regard to the separation direction w, so that the confidence or in otherwords the probability pw?(xdata) is maximal.

This corresponds to the assumption, that the probability of each class isp(c|w) = Ber(c| sigm(w · x)) and can easily be extended to incorporate someprior on w, p(c|w) = Ber(c| sigm(w · x)) p(w) [Murphy, 2012, p. 245].

The paper discusses the priors of the Akaike information criterion (AIC),the Bayesian information criterion (BIC) and the minimum descriptor length(MDL):

AIC The Akaike information criterion penaltizes degrees of freedom, the `0norm, so that the function optimized in the model is logL(w) − `0(w). Thiscorresponds to an exponential distribution for p(w) ∝ exp(−`0(w)). This canbe interpreted as minimizing the variance of the models, since the variance growsexponentially in the number of parameters.

BIC The Bayesian information criterion is similar, but takes the number of

datapoints N into account: p(w) ∝ N−`0(w)

2 . This has an intuitive interpreta-tion if we assume that a variable is either ignored, in which case the specificvalue does not matter, or taken into account, in which case the value influencesthe model. If we assume a uniform distribution on all such models, the ones thatignore become more probable, because they accumulate the probability weightof all possible values.

MDL The minimum descriptor length is related to the algorithmic probabilityand states that the space necessary to store the descriptor gives the best heuristicon how complex the model is. This only implicitly causes variable selection. Theapproximation for this value can be seen in the paper itself.

Since the fitting is computationally expensive, the authors start with a simpleranking on the variables to discard all but the best 5’000. They then repeatedlyfit the respective models and collect the number of appearances of the variablesto rank the best 5, 10, or 15 genes. This step can be seen as an additionalranking step, but this seems unnecessary, since the fitted model by constructionwould already have selected the best model. Even so they still manage to avoidoverfitting and finding a viable subset of discriminative variables.

7

1000

2000

30

500

1000

2000

500

Input

Reconstruction

Bottleneck

Figure 3: An autoencoder network

4.3 Autoencoders as feature creation

Autoencoders are deep neural networks6 that find a fitting information bottle-neck (see 3) by optimizing for the reconstruction of the signal using the inversetransformation7.

Deep networks are difficult to train, since they show many local minima,many of which show poor performance [Murphy, 2012, p. 1000]. To get aroundthis, Hinton and Salakhutdinov [Hinton and Salakhutdinov, 2006] propose pre-training the model as stacked restricted Bolzmann machines before devising aglobal optimisation like stochastical gradient descent.

Restricted Bolzmann machines are easy to train and can be understoodas learning a probability distribution of the layer below. Stacking them meansextracting probable distributions of features, somewhat similar to a distributionof histograms as for example HoG or SIFT being representative to the visualform of an object.

It has long been speculated that only low-level features could be capturedby such a setup, but [Le et al., 2011] show that, given enough resources, anautoencoder can learn high level concepts like recognizing a cat face withoutany supervision on a 1 billion image training set.

The impressive result beats the state of the art in supervised learning byadding a simple logistic regression on top of the bottleneck layer. This impliesthat the features learned by the network capture the concepts present in theimage better than SIFT visual bag of words or other human created featuresand that it can learn a variety of concepts in parallel. Further since the resultof the single best neuron is already very discriminative, it gives evidence forthe possibility of a grandmother neuron in the human brain – a neuron thatrecognizes exactly one object, in this case the grandmother. Using this single

6Deep means multiple hidden layers.7A truely inverse transformation is of course not possible.

8

feature would also take feature selection to the extreme, but without the benefitof being more computationally advantageous.

5 Discussion and outlook

Many of the concepts presented in [Guyon and Elisseeff, 2003] still apply, how-ever the examples fall short on statistical justification. Since then applicationsfor variable and feature selection and feature creation were developed, some ofwhich were driven by advances in computing power, such as high-level featureextraction with autoencoders, others were motivated by integrating prior as-sumptions about the sparcity of the model, such as the usage of probabilisticprincipal component analysis for shape reconstruction.

The goals of variable and feature selection – avoiding overfitting, inter-pretability, and computational efficiency – are in our oppinion problems besttackled by integrating them into the models learned by the classifier and weexpect the embedded approach to be best fit to ensure an optimal treatment ofthem. Since many popular and efficient classifiers, such as support vector ma-chines, linear regression, and neuronal networks, can extended to incorporatesuch constraints with relative ease, we expect the usage of ranking, filtering,and wrapping to be more of a pragmatic first step, before sophisticated learnersfor sparse models are employed. Advances in embedded approaches will makethe performance and accuracy advantages stand out even more.

Feature creation too has seen advances, especially in efficient generalisationsof the principal component analysis algorithm, such as kernel PCA (1998) andsupervised extensions. They predominantly rely on the bayesian formulation ofthe PCA problem and we expect this to drive more innovation in the field, ascan be seen by the spin-off of reconstructing a shape from 2D images using abayesian network as discussed in [Torresani et al., 2008].

References

[Argyriou et al., 2008] Argyriou, A., Evgeniou, T., and Pontil, M. (2008). Con-vex multi-task feature learning. Machine Learning, 73(3):243–272.

[Guo, 2008] Guo, Y. (2008). Supervised exponential family principal compo-nent analysis via convex optimization. In Advances in Neural InformationProcessing Systems, pages 569–576.

[Guyon and Elisseeff, 2003] Guyon, I. and Elisseeff, A. (2003). An introductionto variable and feature selection. The Journal of Machine Learning Research,3:1157–1182.

[Hinton and Salakhutdinov, 2006] Hinton, G. E. and Salakhutdinov, R. R.(2006). Reducing the dimensionality of data with neural networks. Science,313(5786):504–507.

[Le et al., 2011] Le, Q. V., Ranzato, M., Monga, R., Devin, M., Chen, K.,Corrado, G. S., Dean, J., and Ng, A. Y. (2011). Building high-level featuresusing large scale unsupervised learning. arXiv preprint arXiv:1112.6209.

9

[Murphy, 2012] Murphy, K. P. (2012). Machine learning: a probabilistic per-spective. The MIT Press.

[Russell et al., 1995] Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., andEdwards, D. D. (1995). Artificial intelligence: a modern approach, volume 74.Prentice hall Englewood Cliffs.

[Torresani et al., 2008] Torresani, L., Hertzmann, A., and Bregler, C. (2008).Nonrigid structure-from-motion: Estimating shape and motion with hierar-chical priors. Pattern Analysis and Machine Intelligence, IEEE Transactionson, 30(5):878–892.

[Zhang et al., 2011] Zhang, X., Yu, Y., White, M., Huang, R., and Schuurmans,D. (2011). Convex sparse coding, subspace learning, and semi-supervisedextensions. In AAAI.

[Zhou et al., 2005] Zhou, X., Wang, X., and Dougherty, E. R. (2005). Geneselection using logistic regressions based on aic, bic and mdl criteria. NewMathematics and Natural Computation, 1(01):129–145.

10

Variable and feature selection

Education

Transcript of Variable and feature selection