Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid...

Post on 21-Dec-2015

216 views 1 download

Tags:

Transcript of Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid...

Feature Subset Selection using Minimum Cost Spanning Trees

Mike Farah - 18548059Supervisor: Dr. Sid Ray

Outline

• Introduction Pattern Recognition Feature subset selection

• Current methods• Proposed method• IFS• Results• Conclusion

Introduction: Pattern Recognition

• The classification of objects into groups by learning from a small sample of objects Apples and strawberries:

Classes: apples and strawberries Features: colour, size, weight, texture

• Applications: Character recognition Voice recognition Oil mining Weather prediction …

Introduction: Pattern Recognition

• Pattern representation Measuring and recording features Size, colour, weight, texture….

• Feature set reduction Reducing the number of features used Selecting a subset Transformations

• Classification The resulting features are used for

classification of unknown objects

Introduction: Feature subset selection

• Can be split into two processes: Feature subset searching

Not usually feasible to exhaustively try all feature subset combinations

Criterion function Main issue of feature subset selection (Jain et

al. 2000) Focus of our research

Current methods

• Euclidean distance Statistical properties of the classes are

not considered

• Mahalanobis distance Variances and co-variances of the

classes are taken into account

J(x) = min{(μ i −μ j )'(μ i −μ j )}

J(x) = min{(μ i −μ j )'Σij−1(μ i −μ j )}

Limitations of Current Methods

Limitations of Current Methods

Friedman and Rafsky’s two sample test

• Minimum spanning tree approach for determining whether two sets of data originate from the same source

• A MST is built across the data from two sources, edges which connect samples of different data sets are removed

• If many edges are removed, then the two sets of data are likely to originate from the same source

Friedman and Rafsky’s two sample test

• Method can be used as a criterion function• MST built across the sample points• Edges which connect samples of different

classes are removed• A good subset is one that provides

discriminatory information about the classes, therefore the fewer edges removed the better

Limitations of Friedman and Rafsky’s technique

Our Proposed Method

• Use the number of edges and edge lengths in determining the suitability of a subset

• A good subset will have a large number of short edges connecting samples of the same class

• And a small number of long edges connecting samples of different classes

Our Proposed Method

• We experimented with using average edge length and weighted average - weighted average was expected to perform better

J(x) = 0.5 1−betweenNum

betweenNum + withinNum+

Avg(betweenEdges)

Avg(betweenEdges) + Avg(withinEdges)

⎝ ⎜

⎠ ⎟

IFS - Interactive Feature Selector

• Developed to allow users to experiment with various feature selection methods

• Automates the execution of experiments• Allows visualisation of data sets, and

results• Extensible, developers can add criterion

functions, feature selectors and classifiers easily into the system

IFS - Screenshot

IFS - Screenshot

Experimental Framework

Data set No. Samples No. Feats No. Classes

Iris 150 4 3

Crab 200 7 2

Forensic Glass 214 9 7

Diabetes 332 8 2

Character 757 20 8

Synthetic 750 7 5

Experimental Framework

• Spearman’s rank correlationA good criterion function will have good

correlation with the classifier, subsets which are ranked high should achieve high accuracy levels

• Subset chosen Final subsets selected by criterion

functions are compared to the optimal subset chosen by the classifier

• Time

Forensic glass data set results

Forensic glass data set results

Synthetic data set

Synthetic data set

Algorithm completion times

Algorithm completion times

Algorithm complexities

• K-NN

• MST criterion functions

• Mahalanobis distance

• Euclidean distance

O(N 2(F + log(N) +K))

O(N 2 *F)

O(C2 *F 2(N + F))

O(C2 *F)

Conclusion

• MST based approaches generally achieved higher accuracy values and rank correlation - in particular with the K-NN classifier

• Criterion function based on Friedman and Rafsky’s two sample test performed the best

Conclusion

• MST approaches are closely related with the KNN classifier

• Mahalanobis criterion function suited to data sets with Gaussian distributions and strong feature interdependence

• Future work: Construct a classifier based on KNN, which

gives closer neighbours higher priority Improve IFS