Minimum Redundancy and Maximum Relevance Feature Selection
description
Transcript of Minimum Redundancy and Maximum Relevance Feature Selection
![Page 1: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/1.jpg)
Minimum Redundancy and Maximum Relevance Feature Selection
Hang Xiao
![Page 2: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/2.jpg)
Background
• Feature– a feature is an individual measurable heuristic
property of a phenomenon being observed– In character recognition: horizontal and vertical
profiles, number of internal holes, stroke detection
– In speech recognition: noise ratios, length of sounds, relative power, filter matches
– In microarray : genes expression
![Page 3: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/3.jpg)
Background
• Relevance between features– Correlation– F-statistic– Mutual information
Independent : p(x,y) = p(x)p(y) I(x,y) = 0
p(x,y) : joint distribution function of X and Yp(x), p(y) : marginal probability distribution functions
![Page 4: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/4.jpg)
Feature Selection Problem
• Maximal relevance– selecting the features with the highest relevance to the
target class c, based on mutual info., F-test, etc. without considering relationships among features
• Minimal Redundancy– Selected features are correlated– Selected features cover narrow regions in space
![Page 5: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/5.jpg)
mRMR: Discrete Variables
• Maximize Relevance:
• Minimal Redundancy:
S is the set of featuresI(i,j) is mutual information between feature i and j
![Page 6: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/6.jpg)
mRMR: Continuous Variables
• Maximum relevance: F-statistic F(i,h)
• Minimum redundancy : Correlation cor(i,j)
![Page 7: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/7.jpg)
Combine Relevance and Redundancy
• Additive combination
• Multiplicative combination
![Page 8: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/8.jpg)
Most Related Methods
• Most used feature selection methods: top-ranking features without considering relationships among features.
• Yu & Liu, 2003/2004. information gain, essentially similar approach
• Wrapper: not filter approach, classifier-involved and thus features do not generalize well
• PCA and ICA: Feature are orthogonal or independent, but not in the original feature space
![Page 9: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/9.jpg)
Class Prediction Methods
• Naive Bayes (NB) classifier
{g1, g2, …, gm} gene expression level
p(gi|hk) is conditional table (density)
• Support Vector Machine SVM– Draw an optimal hyperplane in the feature vector
space
![Page 10: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/10.jpg)
Class Prediction Methods
• Logistic Regression (LR) – a linear combination of the feature variables – transformed into probabilities by a logistic
function
• Linear Discriminant Analysis (LDA)– Find a linear combination of feature– ANOVA , regression analysis
![Page 11: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/11.jpg)
Microarray Gene Expression Data Sets for Cancer Classification
![Page 12: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/12.jpg)
LOOCV : Leave- One-Out Cross ValidationBaseline feature : based solely on maximum relevance
![Page 13: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/13.jpg)
(a) Relevance VI, and (b) Redundancy for MRMR features on discretized NCI dataset. (c) The respective LOOCV errors obtained using the Naive Bayes
classifier
The role of redundancy reduction
![Page 14: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/14.jpg)
Do mRMR Features Generalize Well on Unseen Data?
Child Leukemia data (7 classes, 215 training samples, 112 testing samples) testing errors. M is the number of features used in classification
![Page 15: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/15.jpg)
What is the Relationship of mRMR Features and Various Data Discretization Schemes?
LOOCV testing results classifier(#error) for binarized NCI and Lymphoma data using SVM classifier.
![Page 16: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/16.jpg)
Comparison with other work
![Page 17: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/17.jpg)
Theoretical basis of mRMR
• Maximum Dependency Criterion– Statistic association– Definition : mutual information I(Sm,h)
• Mutual Information– For two variables x and y
– For multivariate variable Sm and the target h
![Page 18: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/18.jpg)
High-Dimensional Mutual Information • For multivariate variable Sm and the target h
• Estimate high-dimensional I(Sm,h) is so difficult– An ill-posed problem to find inverse of large co-
variance matrix– Insufficient number of samples– Combinatorial time complex O(C(|Ω|,|S|))
![Page 19: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/19.jpg)
Factorize the Mutual Information • Mutual information for multivariate variable Sm and the target h
Define:
It can be proved:
![Page 20: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/20.jpg)
Factorize I(Sm,h)
• Relevance of S={x1,x2, …} and h, or RL(S,h)
• Redundancy among variables {x1,x2,...}, or RD(S)
• For incremental search, max I(S,h) is “equivalent” to max [RL(S,h) – RD(S)], so called min-Redundancy-Max-Relevance(mRMR)
![Page 21: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/21.jpg)
Advantages of mRMR • Both relevance and redundancy estimation are low-
dimensional problems (i.e. involving only 2 variables). This is much easier than directly estimating multivariate density or mutual information in the high- dimensional space!
• Fast speed• More reliable estimation
• mRMR is an optimal first-order approximation of I(.) maximization
• Relevance-only ranking only maximizes J(.)!
![Page 22: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/22.jpg)
Search Algorithm of mRMR
• Greedy search algorithm– In the pool Ω, find the variable x1 that has the
largest I(x1,h). Exclude x1 from Ω
– Search x2 so that it maximizes I(x2,h) - ∑I(.,x2)/|Ω|– Iterate this process until an expected number of
variables have been obtained, or other constraints are satisfied
• Complexity O(|S|*|Ω|)
![Page 23: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/23.jpg)
Comparing Max-Dep and mRMR: Complexity of Feature Selection
![Page 24: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/24.jpg)
Comparing Max-Dep and mRMR: Accuracy of Feature Selected in Classification
• Leave-One-Out cross validation of feature classification accuracies of mRMR and MaxDep
![Page 25: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/25.jpg)
Use Wrappers to Refine Features • mRMR is a filter approach
– Fast – Features might be redundant – Independent of the classifier
• Wrappers seek to minimize the number of errors directly – Slow – Features are less robust – Dependent on classifier – Better prediction accuracy
• Use mRMR first to generate a short feature pool and use wrappers to get a least redundant feature set with better accuracy
![Page 26: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/26.jpg)
Use Wrappers to Refine Features
Forward wrappers (incremental selection)
Backward wrappers (decremental selection)
NCI Data
![Page 27: Minimum Redundancy and Maximum Relevance Feature Selection](https://reader033.fdocuments.us/reader033/viewer/2022061514/568166a8550346895dda9d0b/html5/thumbnails/27.jpg)
Conclusions
• The Max-Dependency feature selection can be efficiently implemented as the mRMR algorithm
• Significantly outperforms the widely used max-relevance selection method: mRMR features cover a broader feature space with less features
• mRMR is very efficient and useful for gene selection and many other applications. The programs are ready!