Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input...
-
Upload
laurence-butler -
Category
Documents
-
view
223 -
download
2
Transcript of Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input...
![Page 1: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/1.jpg)
Introduction to Introduction to variable selection variable selection
IIQi YuQi Yu
![Page 2: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/2.jpg)
2
Problems due to poor variable Problems due to poor variable selection:selection:
Input dimension is too large; the curse Input dimension is too large; the curse of dimensionality problem may happen;of dimensionality problem may happen;
Poor model may be built with additional Poor model may be built with additional unrelated inputs or not enough relevant unrelated inputs or not enough relevant inputs;inputs;
Complex models which contain too many Complex models which contain too many inputs is more different to understandinputs is more different to understand
![Page 3: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/3.jpg)
3
Two broad classes of variable Two broad classes of variable selection methods: filter and selection methods: filter and
wrapper wrapper FFilterilter method is method is a pre-
processing step, which is independent of the learning algorithm.
The inputs subset is chosen by an evaluation criterion, which measures the relation of each subset of input variables with the output.
![Page 4: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/4.jpg)
4
Two broad classes of variable Two broad classes of variable selection methods: filter and selection methods: filter and
wrapperwrapper Learning model is used as a
part of evaluation function and also to induce the final learning model.
Optimizing the parameters of the model by measuring some cost functions.
Finally, the set of inputs can be selected using LOO, bootstrap or other re-sampling techniques.
![Page 5: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/5.jpg)
5
Comparsion of filter and Comparsion of filter and wrapper:wrapper: Wrapper method tries to solve real Wrapper method tries to solve real
problem, hence the criterion can be problem, hence the criterion can be really optimaized; but it is potentially really optimaized; but it is potentially very time consuming very time consuming since they typically need to evaluate a cross-validation scheme at every iteration.
Filter method is much faster but it do Filter method is much faster but it do not incorporate learning.not incorporate learning.
![Page 6: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/6.jpg)
6
Embeded methodsEmbeded methods
In contrast of filter and wrapper In contrast of filter and wrapper approaches, in embedded methods the approaches, in embedded methods the features selection part can not be features selection part can not be separated from the learning part.separated from the learning part.
Structure of the class of function under Structure of the class of function under consideration plays a crucial roleconsideration plays a crucial role
Existing embedded methods are reviewed Existing embedded methods are reviewed based on a unifying mathematical based on a unifying mathematical framework. framework.
![Page 7: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/7.jpg)
7
Embeded methodsEmbeded methods
Forward-Backward Methods
Optimization of scaling factors
Sparsity term
![Page 8: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/8.jpg)
8
Forward-Backward Methods
Forward selection methods: these methods start with one or a few features selected according to a method specific selection criteria. More features are iteratively added until a stopping criterion is met.
Backward elimination methods: methods of this type start with all features and iteratively remove one feature or bunches of features.
Nested methods: during an iteration features can be added as well as removed from the data.
![Page 9: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/9.jpg)
9
Forward selectionForward selection
Forward selection with Least Forward selection with Least squaressquares
GraftingGrafting
Decision treesDecision trees
![Page 10: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/10.jpg)
10
Forward selection with Least Forward selection with Least squaressquares
1. Start with and2. Find the component i such that
is minimal.3. Add i to S4. Recompute the residuals Y with PSY
5. Stop or go back to 2
TmyyY )( 1 S
2
YP i
![Page 11: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/11.jpg)
11
GraftingGrafting For fixed , Perkins suggested For fixed , Perkins suggested
minimizing the function:minimizing the function:
over the set of parameters which definesover the set of parameters which defines
To solve this in a forward way:To solve this in a forward way: In every iteration the working set of
parameters is extended by one and the newly obtained objective function is minimized over the enlarged working set.
The selection criterion for new parameters is .
0,, 210
)()),,((1
)( 0011221
lyxfLm
Cm
kkk
|),(fF
i
iC /
![Page 12: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/12.jpg)
12
Decision treesDecision trees
Decision trees are iteratively build by splitting the data depending on the value of a specific feature.
A widely used criterion for the importance of a feature is the mutual information between feature i and the outputs Y :
where H is the entropy and
)/()(),( ii XYHYHYXMI
),,( ,,1 imii xxX
![Page 13: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/13.jpg)
13
Backward EliminationBackward Elimination
Recursive Feature Elimination (RFE) , given that one wishes to employ only input dimensions in the final decision rule, attempts to find the best subset of size by a kind of greedy backward selection.
Algorithm of RFE in the linear case: 1: repeat 2: Find w and b by training a linear SVM. 3: Remove the feature with the smallest value 4: until features remain.
n0
0
0i
![Page 14: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/14.jpg)
14
Embeded methodsEmbeded methods
Forward-Backward Methods
Optimization of scaling factors
Sparsity term
![Page 15: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/15.jpg)
15
Optimization of scaling factors
Scaling Factors for SVM
Automatic Relevance Determination
Variable Scaling: Extension to Maximum Entropy Discrimination
![Page 16: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/16.jpg)
16
Scaling Factors for SVM
Feature selection is performed by scaling the input parameters by a vector . Larger values of indicate more useful features.
Thus the problem is now one of choosing the best kernel of the form:
We wish to find the optimal parameters which can be optimized by many criterias, i.e. gradient descent on the R2W2 bound, span bound or a validation error.
n]1,0[
)*,*(),( xxkxxk
![Page 17: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/17.jpg)
17
Optimization of scaling factors
Scaling Factors for SVM
Automatic Relevance Determination
Variable Scaling: Extension to Maximum Entropy Discrimination
![Page 18: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/18.jpg)
18
Automatic Relevance Determination
In a probabilistic framework, a model of the likelihood of the data is chosen P(y|w) as well as a prior on the weight vector, P(w).
To predict the output of a test point x, the average of fw(x) over the posterior distribution P(w|y) is computed, that is using function fwMAP to predict. wMAP is the vector of parameters called the Maximum a Posteriori (MAP), i.e.
)(log)|(logminarg)|(maxarg wPwyPywPwww
MAP
![Page 19: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/19.jpg)
19
Variable Scaling: Extension to Maximum Entropy
Discrimination The Maximum Entropy Discrimination
(MED) framework is a probabilistic model in which one does not learn parameters of a model, but distributions over them.
Feature selection can be easily integrated in this framework . For this purpose, one has to specify a prior probability p0 that a feature is active.
![Page 20: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/20.jpg)
20
Variable Scaling: Extension to Maximum Entropy
Discrimination If wi would be the weight associated with
a given feature for a linear model, then the expectation of this weight modified as follows:
This has the effect of discarding the components for which
This algorithm ignores features whose weights are smaller than a threshold.
20
0
11 exp( )
i
i
wp
wp
2 0
0
1logi
pw
p
![Page 21: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/21.jpg)
21
Sparsity term
In the case of linear models, indicator variables are not necessary as feature selection can be enforced on the parameters of the model directly.
This is generally done by adding a sparsity term to the objective function that the model minimizes.
Feature Selection as an Optimization Problem
Concave Minimization
![Page 22: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/22.jpg)
22
Feature Selection as an Optimization Problem
Most linear models that we consider can be understood as the result of the following minimization:
where measures the loss of
function on the training point
)(),(1
min1
,wCybxwL
m
m
kkk
bw
),( kk yx
)),(( kk yxfL bxwxf )(
![Page 23: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/23.jpg)
23
Feature Selection as an Optimization Problem
Examples of empirical errors are: 1. l1 hinge loss
2. l2 loss
3. Logistic loss
)1log(),( )( bxwyLogistic eybxwl
22 )(),( ybxwybxwl
)(1),( ybxwyybxwlhinge
![Page 24: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/24.jpg)
24
Concave Minimization In the case of linear models, feature
selection can be understood as the optimization problem:
For example, Bradley proposed to approximate the function as:
Weston et al. use a slightly different function. They replace the l0 norm by:
)(min 0,
wlbw
n
iiwwl
10 |)|exp(1)(
n
iiwwl
10 )log()(
![Page 25: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/25.jpg)
25
Summary of embeded Summary of embeded methodsmethods
Embeded method is built upon the concept of scaling factors. We discussed embedded methods along how they approximate the proposed optimization problems:
Explicit removal or addition of features - the scaling factors are optimized over the discrete set {0, 1}n in a greedy iteration;
Optimization of scaling factors over the compact interval [0, 1]n, and
Linear approaches, that directly enforce sparsity of the model parameters.
![Page 26: Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e575503460f94b4f76f/html5/thumbnails/26.jpg)
Thank you !Thank you !