Feature Selection in Machine Learning

Data Analytics & Machine LearningMCS4102

Assignment 2

Feature Selections with Trajan Simulator

U.V Vandebona

Content

• Feature Selection• Dataset 1 - Iris Dataset• Forward Selection• Backward Selection• Genetic Algorithm

• Dataset 2 - Abalone Dataset• Dataset 3 – Custom Dataset

Data Set (1) - Iris The data set contains 3 classes of 50 instances

each, where each class refers to a type of iris flower.

Data Set (1) - Iris Attribute Information (All in centimeters)

› Sepal length› Sepal width› Petal length› Petal width› Flower class

Ex: 5.3,3.7,1.5,0.2,Iris-setosa5.0,3.3,1.4,0.2,Iris-setosa7.0,3.2,4.7,1.4,Iris-versicolor6.4,3.2,4.5,1.5,Iris-versicolor6.3,3.3,6.0,2.5,Iris-virginica5.8,2.7,5.1,1.9,Iris-virginica

Iris Class Features1. One of the classes (Iris Setosa) is linearly separable

from the other two. However, the other two classes are not linearly separable.

2. There is some overlap between the Versicolor and Virginica classes, so that it is impossible to achieve a perfect classification rate.

3. There is some redundancy in the four input variables, so that it is possible to achieve a good solution with only three of them, or even (with difficulty) from two.

Import and Setup Data

1

3 2

Import and Setup Data

1. Iris dataset is just a simple dataset that values are delimited by commas.

2. Dataset doesn’t include any variable naming or case naming.

3. We can edit our dataset to give proper variable naming.

› Class field has automatically turned into a nominal field as it contain only three nominal values.

Feature Selection Analysis Feature Selection From the available variables, set the dependent

and independent (output and input) variables.

Dependent Variable :

ClassIndependent Variable :

Sepal Width Sepal LengthPetal Width Petal Length

4

Why Feature Selection ? Dependent or output variable states which

flower class the record belongs to; Either Virginica, Versicolor or Setosa.

Independent or input variables are used to predict that decision.

Typically we do not have a strong idea of the relationship between the available variables and the desired prediction.

Why Feature Selection ? To an extent, some neural network architectures

(e.g., multilayer perceptrons) can actually learn to ignore useless variables.

However, other architectures (e.g., radial basis functions) are adversely affected, and in all cases a larger number of inputs implies that a larger number of training cases are required.

As a rule of thumb, the number of training cases should be, a good, few times bigger than the number of weights in the network, to prevent over-learning.

Why Feature Selection ? As a consequence, the performance of a network

can be improved by reducing the number of inputs, even sometimes at the cost of losing some input information.› In many problem domains, a range of input variables

are available which may be used to train a neural network, but it is not clear which of them are most useful, or indeed are needed at all.

Why Feature Selection ? In non-linear problems, there may be

interdependencies and redundancies between variables; › for example, a pair of variables may be of no value

individually, but extremely useful in conjunction, or any one of a set of parameters may be useful.

› It is not possible, in general, to simply rank parameters in order of importance.

Why Feature Selection ? The "curse of dimensionality" means that it is

sometimes actually better to discard some variables that do have a genuine information content, simply to reduce the total number of input variables, and therefore the complexity of the problem, and the size of the network.

Counter-intuitively, this can actually improve the network's generalization capabilities.

Why Feature Selection ? The method that is guaranteed to select the best

input set, is to train networks with all possible input sets and all possible architectures, and to select the best. › In practice, this is impossible for any significant

number of candidate inputs. If you wish to examine the selection of variables

more closely yourself, Feature Selection is a good technique.

Feature Selection The Feature Selection Algorithms conduct a large

number of experiments with different combinations of inputs, building probabilistic or generalized regression networks for each combination, evaluating the performance, and using this to further guide the search.

This is a "brute force" technique that may sometimes find results much faster.

Feature Selection It explicitly identify input variables that do not

contribute significantly to the performance of networks, then by suggest to remove them.

These algorithms are either stepwise algorithms that progressively add or remove variables, or genetic algorithms.

Sampling - Random5.1 Randomized subset assignment to train, select

and test.5.1

Sampling - Fixed5.2 Fixed subset assignment to train, select and

test.› Add a column containing nominal values “Train”,

“Select”, ”Test” and “Ignore”. For generate the values, support of spreadsheet package may needed. Name the column as NNSET.

5.2

Sampling - Fixed

Run the Feature Selection once and these subsets will be assigned after that.

Sampling A major problem with neural networks is the

generalization issue (the tendency to overfit the training data), accompanied by the difficulty in quantifying likely performance on new data.

It is important to have ways to estimate the performance of the models on new data, and to be able to select among them.

Most work on assessing performance in neural modeling concentrates on approaches to resampling.

Sampling Typically the neural network is trained in using a

training subset. The test subset is used to perform an unbiased

estimation of the network's likely performance.

Sampling Often, a separate subset (the selection subset) is

used to halt training to mitigate over-learning, or to select from a number of models trained with different parameters. It keep an independent check on the performance of the networks during training with deterioration in the selection error indicating over-learning.

If over-learning occurs, stops training the network, and restores it to the state with minimum selection error.

Feature Selection – Results Configuration

7

6


6. In the results shown after analysis, each row will represents a particular test of a combination of inputs. So with this, it will show every combination of inputs.

7. It is sometimes a good idea to reduce the number of input variables to a network even at the cost of a little performance, as this improves generalization capability and decreases the network size and execution size.


You can apply some extra pressure to eliminate unwanted variables by assigning a Unit Penalty.

› This is multiplied by the number of units in the network and added to the error level in assessing how good a network is, and thus penalizes larger networks.


If there are a large number of cases, the evaluations performed by the feature selection algorithms can be very time-consuming (the time taken is proportional to the number of cases).

› For this reason, you can specify a sub-sampling rate. (However, in this case as we have very few cases, the sampling rate of 1.0 (the default) is fine).

Forward Selection Begins by locating a single input variable, that on

its own, best predicts the output variable. It then checks for a second variable, that added

to the first. Repeat the process until either all variables have

been selected or no further improvement is made.

Good for larger number of variables.

Forward Selection Generally Faster. Much faster if there are few

relevant variables, as it will locate them at the beginning of its search.

Can behave sensibly when data set has large number of variables as it selects variables initially.

It may miss key variables if they are interdependent. (that is where two or more variables must be added at the same time in order to improve the model.)

Results The row label indicates the stage; (e.g. 2.3

indicates the third test in stage 2. ) The final row replicates the best result found, for convenience. The first column is the selection error of the Probabilistic Neural Network (PNN) or Generalized Regression Neural Network (GRNN). Subsequent columns indicate which inputs were selected for that particular combination.

Results

0 Penalty

0.003 Penalty

0.001 Penalty

0.005 Penalty0.012 Penalty

0.002 Penalty

Conclusion : By considering the span of the error values of above results with penalty value, Petal Width and Petal Length are good features to use if needed reducing of input features.

Backward Selection A Reverse process. Starts with a model including all the variables

and then removes them one at a time At each stage finding the variable that, when it is

removed least degrades the model. Good for smaller (20 or less) number of

variables.

Backward Selection Doesn’t suffer from missing key variables. As it starts with the whole set of variables, the

initial evaluations are most time consuming. Suffer from large number of variables. Specially if

there are only a few weakly predictive ones in the set.

Not cut down the irrelevant variables until the very end of its search.

Results

0 Penalty

0.003 Penalty

0.001 Penalty

0.004 Penalty0.012 Penalty

0.002 Penalty

Conclusion : By considering the span of the error values of above results with penalty value, Petal Width, Petal Length and Sepal Length are good features to use if needed reducing of input features.

Genetic Algorithm A optimization algorithm. Genetic algorithms are a particularly effective

search technique for combinatorial problems (where a set of interrelated yes/no decisions needs to be made).

The method is time-consuming (it typically requires building and testing many thousands of networks)

Genetic Algorithm For reasonably-sized problem domains (perhaps

50-100 possible input variables, and cases numbering in the low thousands), the algorithm can be employed effectively overnight or at the weekend on a fast PC.

With sub-sampling, it can be applied in minutes or hours, although at the cost of reduced reliability for very large numbers of variables.

Genetic Algorithm Run with the default settings, it would perform

10,000 evaluations (100 population times 100 generations).

Since our problem has only 4 candidate inputs, the total number of possible combinations is only 16 (2 raised to the 4th power).

Results

0 Penalty 0.003 Penalty

0.004 Penalty

0.012 Penalty

Conclusion : By considering the span of the error values of above results with penalty value, Petal Width, Petal Length and Sepal Length are good features to use if needed reducing of input features.

Data Set (2) - Abalone The age of abalone can determined by

counting the number of rings. The number of rings is the value to predict

from physical measurements.

Data Set (2) - AbaloneAttribute Name

Data Type Measurement Unit

Description

Sex nominal M, F, and I (infant)Length continuous mm Longest shell

measurement Diameter continuous mm perpendicular to lengthHeight continuous mm with meat in shell Whole weight continuous grams whole abalone Shucked weight continuous grams weight of meatViscera weight continuous grams gut weight (after

bleeding)Shell weight continuous grams after being driedRings integer +1.5 gives the age in

years

Results - Forward Selection

Conclusion : Sex, Whole Weight and Shell Weight are good features to use if needed reducing of input features.

Height doesn’t give any useful effect.

High Penalty : 0.001

Less Penalty : 0.0001

Results - Backward Selection

Conclusion : Sex, Whole weight and Shell weight are good features to use if needed reducing of input features.




Results - Genetic Algorithm

Conclusion : Sex, Whole weight and Shell weight are good features to use if needed reducing of input features.


Sampling rate : 0.1



Data Set (3) - Custom Set 4 Classes

17 Attribute Features

• C1 • C2 • C3 • C4

•F1•F2•F3•F4•F5•F6

•F7•F8•F9•F10•F11•F12

•F13•F14•F15•F17•F18

Forward Selection ResultsHigh Penalty : 0.001


Conclusion : Feature F14 is a good features to use if needed reducing of input features.

Feature F2, F5, F6, F7, F9 and F11 doesn’t give any useful effect.

Backward Selection Results

High Penalty : 0.001Less Penalty : 0.0001



Genetic Algorithm Results

High Penalty : 0.001Less Penalty : 0.0001



Reference

http://archive.ics.uci.edu/ml/datasets/Iris

[Online 2015-10-25]

http://archive.ics.uci.edu/ml/datasets/Abalone

[Online 2015-10-25]

Trajan Neural Network Simulator Help

http://archive.ics.uci.edu/ml/datasets/Iris

http://archive.ics.uci.edu/ml/datasets/Abalone

Feature Selection in Machine Learning

Education

Transcript of Feature Selection in Machine Learning