Dimensionality Reduction by Feature Selection in Machine Learning Dunja Mladenič J.Stefan...
-
Upload
kiera-balon -
Category
Documents
-
view
222 -
download
0
Transcript of Dimensionality Reduction by Feature Selection in Machine Learning Dunja Mladenič J.Stefan...
Dimensionality Reduction by Feature Selection in Machine Learning
Dunja MladeničJ.Stefan Institute, Slovenia
Reasons for dimensionality reduction
Dimensionality reduction in machine learning is usually performed to:
Improve the prediction performance Improve learning efficiency Provide faster predictors possibly requesting
less information on the original data Reduce complexity of the learned results,
enable better understanding of the underlying process
Approaches to dimensionality reduction
Map the original features onto the reduced dimensionality space by:
selecting a subset of the original features no feature transformation, just select a feature subset
constructing features to replace the original features
using methods from statistics, such as, PCA using background knowledge for constructing
new features to be used in addition/instead of the original features (can be followed by feature subset selection)
general background knowledge (sum or product of features,...) domain specific background knowledge (parser for text data to
get noun phrases, clustering of words, user-specified function,…)
Addressed here
Example for the problem Data set
Five Boolean features C = F1 V F2
F3 = ┐F2 , F5 = ┐F4
Optimal subset: {F1, F2} or {F1, F3}
optimization in space of all feature subsets ( possibilities)
(tutorial on genomics [Yu 2004])
F1 F2 F3 F4 F5 C
0 0 1 0 1 0
0 1 0 0 1 1
1 0 1 0 1 1
1 1 0 0 1 1
0 0 1 1 0 0
0 1 0 1 0 1
1 0 1 1 0 1
1 1 0 1 0 1
F2
Search for feature subset An example of search space (John & Kohavi 1997)
Forward selection
Backward elimination
Feature subset selection commonly used search strategies:
forward selection FSubset={}; greedily add features one at a time
forward stepwise selection FSubset={}; greedily add or remove features one at a time
backward elimination FSubset=AllFeatures; greedily remove features one at a time
backward stepwise elimination FSubset=AllFeatures; greedily add or remove features one at a
time random mutation
FSubset=RandomFeatures; greedily add or remove randomly selected feature one at a time stop after a given number of iterations
Approaches to feature subset selection
Filters - evaluation function independent of the learning algorithm
Wrappers - evaluation using model selection based on the machine learning algorithm
Embedded approaches - feature selection during learning
Simple Filters - assume feature independence (used for problems with large number of features, eg. text classification)
Filtering
Evaluation independent of ML algorithm
Filters: Distribution-based [Koller & Sahami 1996]
Idea: select a minimal subset of features that keeps class probability distribution close to the original distribution P(C|FeatureSet) is close to P(C|AllFeatures)
1. start with all the features2. use backward elimination to eliminate a
predefined number of features
evaluation: the next feature to be deleted is obtained using Cross-entropy measure
Evaluation of a feature subset1. represent examples using the feature subset2. on a random subset of examples calculate
average difference in distance from the nearest example of the same class and the nearest
example of the different class
F discrete F cont.
some extensions, empirical and theoretical analysis in [Robnik-Sikonja & Kononenko 2003]
Filters: Relief [Kira & Rendell 1992]
otherwise
FExFEx
;1
)()(;0 21
)min()max(
)()( 21
FF
FExFEx
FSubset
2121 ))(),(diff()Ex,(Ex1i
ii FExFEx
Filters: FOCUS [Almallim & Dietterich 1991]
Evaluation of a feature subset1. represent examples using the feature subset2. count conflicts in class value (two examples with the
same feature values and different class value)
Search: all the (promising) subsets of the same (increasing) size are evaluated until a sufficient (no conflicts) subset is found
assumes existence of a small sufficient subset --> not appropriate for tasks with many features
some extensions of the algorithm use heuristic search to avoid evaluating all the subsets of the same size
Illustration of FOCUS
F1 F2 F3 F4 F5 C
0 0 1 0 1 0
0 1 0 0 1 1
1 0 1 0 1 1
1 1 0 0 1 1
0 0 1 1 0 0
0 1 0 1 0 1
1 0 1 1 0 1
1 1 0 1 0 1
F4 F5 C
0 1 0
0 1 1
0 1 1
0 1 1
1 0 0
1 0 1
1 0 1
1 0 1
Conflict!
Conflict!
Filters: Random [Liu & Setiono
1996]
Evaluation of a feature subset1. represent examples using the feature subset2. calculate the inconsistency rate
(the average difference between the number of examples with equal feature values and the number of examples among them with the locally, most frequent class value)
3. select the smallest subset with inconsistency rate below the given threshold
Search: random sampling to search the space of feature subsets
evaluate the predetermined number of subsets noise handling by setting the threshold > 0 if threshold = 0, then the same evaluation as in
FOCUS
Filters: MDL-based [Pfahringer
1995]
Evaluation using Minimum Description Length
1. represent examples using the feature subset2. calculate MDL of a simple decision table
representing examples
Search: start with random feature subset and add or delete a feature, one at a time
performs at least as well as the wrapper approach applied on the simple decision tables and scales up better to large number of training examples
Wrapper
Evaluation uses the same ML algorithm that is used after the feature selection
Wrappers: Instance-based learning
Evaluation using instance-based learning represent examples using the feature subset estimate model quality using cross-validation
Search [Aha & Bankert 1994] start with random feature subset use beam search with backward elimination
Search [Skalak 1994] start with random feature subset use random mutation
Wrappers: Decision tree induction
Evaluation using decision tree induction represent examples using the feature subset estimate model quality using cross-validation
Search [Bala et al 1995], [Cherkauer & Shavlik 1996] using genetic algorithm
Search [Caruana & Freitag 1994] adding and removing features (backward stepwise
elimination) additionally, at each step removes all the features that were
not used in the decision tree induced for the evaluation of the current feature subset
Metric-based model selection
Idea:poor models behave differently on training and other dataEvaluation using machine learning algorithm represent examples using the feature subset generate model using some ML algorithm estimate model quality comparing the performance of two models
on training and on unlabeled data, chose the largest subset that satisfies triangular inequality with all the smaller subsets
Combine metric and cross-validation [Bengio & Chapados 2003] based on their disagreement on testing examples (higher
disagreement means lower trust to cross-validation) Intuition: cross-validation provides good results but has high
variance and should benefit from a combination with model selection having with lower variance
ijPfdPfdffd xYjTxYiTjiU 0),...,(),(),( ||
Embedded
Feature selection as integral part of model generation
Embedded
at each iteration of the incremental optimization of the model
use a fast gradient-based heuristic to find the most promising feature [Perkins et al 2003]
Idea: features that are relevant to the concept should affect the generalization error bound of non-linear SVM more than irrelevant features
use backward elimination based on the criteria derived from generalization error bounds of the SVM theory (the weight vector norm or, using upper bounds of the leave-one-out error) [Rakotomamonjy 2003]
Embedded: in filters [Cardie 1993]
Use embedded feature selection as pre-processing
evaluation and search using the process embedded in decision tree induction
the final feature subset contains only the features that appear in the induced decision tree
used for learning using Nearest Neighbor algorithm
Simple Filtering
Evaluation independent of ML algorithm
Feature subset selection on text data – commonly used methods
Simple filtering using some scoring measure to evaluate individual feature
supervised measures: information gain, cross entropy for text (information gain on
only one feature value), mutual information for text supervised measures for binary class
odds ratio (target class vs. the rest), bi-normal separation unsupervised measures:
term frequency, document frequency Simple filtering using embedded approach to score
the features scoring measure equal to weights in the normal to the
hyperplane of linear SVM trained on all the features [Brank et al 2002]
learning using linear SVM, Perceptron, Naïve Bayes
Scoring individual feature
InformationGain: CrossEntropyTxt: MutualInfoTxt: OddsRatio: Frequency: Bi-NormalSeparation:F - Normal distribution cumulative probability function
)|())|(1(
))|(1()|(log
negWPposWP
negWPposWP
)(WFreq
WWF negposC CP
FCPFCPFP
, , )(
)|(log)|()(
negposC CP
WCPWCPWP
, )(
)|(log)|()(
negposC WP
CWPCP
, )(
)|(log)(
))|(())|(( 11 negWPFposWPF
Influence of feature selection on the classification performance
Some ML algorithms are more sensitive to the feature subset than other Naïve Bayes on document categorization
sensitive to the feature subset Linear SVM has embedded weighting of
features that partially compensates for feature selection
Illustration of feature selection
Naïve Bayes on Yahoo! hierarchy data Comparison of different feature scoring
measures in simple filtering Linear SVM on standard Reuters-
2000 news data Comparison of scoring measures
including embedded SVM-normal and perceptron used as pre-processing
Illustration on 5 datasets from Yahoo! hierarchy using Naïve Bayes [Mladenic & Grobelnik 2003]
Yahoocategory
# of sub-ctgs.(nodes)
# of words/phrases(1-grams+…5grams)
# ofdocuments
Avg. nodedistance
Avg. # ofwords perdocument
Entertain-ment
8,081 30,998(15144+11211+2970+1059+505)
79,011 7.56 60
Arts andHumanities
3,085 11,473(7380+3538+463+75+17)
27,765 6.59 65
Computersand Internet
2,652 7,631(5049+2276+261+38+7)
23,105 6.77 55
Education 349 3,198(1919+1061+184+28+6)
5,406 4.54 100
Reference 129 928(701+196+28+3+0)
1,995 4.29 45
OddsRatio
CrossEntropy
MutualInf
Random
InfGain
• Feature subset size importantly influences the performance
• Some measures more sensitive than other
Scoring Rank F2 Precision Recall Ctgs
Ent. OddsRatio TermFq CrossEntTxt MutInfTxt InfGain Rnd
16 49 4103 4197 4054 4050
0.59 0.49 0.39 0.27 0.22 0.002
0.44 0.41 0.35 0.57 0.86 0.99
0.80 0.71 0.69 0.38 0.21 0.001
115 756 1612 1461 1423 30
Arts. OddsRatio TermFq CrossEntTxt MutInfTxt InfGain Rnd
9.9 15.2 1560 1620 1592 1549
0.59 0.58 0.44 0.35 0.21 0.001
0.40 0.48 0.33 0.56 0.94 0.99
0.83 0.77 0.75 0.46 0.20 0.001
120 299 626 558 343 24
Comp OddsRatio TermFq CrossEntT MutInfT InfGain Rnd
11.9 89.2 1355 1448 1393 1333
0.60 0.58 0.49 0.38 0.17 0.01
0.40 0.50 0.35 0.62 0.94 0.99
0.84 0.74 0.75 0.45 0.15 0.004
138 511 757 739 450 27
Edu. TermFq OddsRatio MutInfTxt CrossEntTxt InfGain Rnd
29.5 9.4 43 10.8 113 178
0.55 0.48 0.46 0.46 0.11 0.01
0.57 0.36 0.48 0.28 0.98 0.99
0.65 0.81 0.59 0.82 0.11 0.003
113 35 158 162 144 18
Ref. OddsRatio CrossEntTxt MutInfTxt TermFq InfGain Rnd
3.8 3.3 6.9 3.8 27 65.8
0.64 0.60 0.55 0.53 0.22 0.06
0.51 0.62 0.73 0.78 0.99 0.99
0.81 0.71 0.60 0.57 0.21 0.05
15 46 51 38 38 8
Rank of the correct category in the list of all categories
F2-measure combining precision and recall emphases on recall
Ctgs – number of categories looking promising (testing example needs to be classified by their models)
best results: Odds ratio using only a small number
of features (50-100, 0.2%-5%) improves performance of
Naïve Bayes surprisingly good results:
unsupervised Term frequency
poor results: Information gain
probably because it is not compatible with Naïve Bayes (selects mostly features representative for neg. class and features informative when not occurring in the document)
Illustration on Reuters-2000 Data [Brank et al 2002]
Reuters-2000 Data used in the experiments 16 categories covering the range of break-event point
(estimated on a sample) and class distribution Training: sample of 118,294 articles from the training
period Testing: 302,323 articles from the test period
~810,000 News articles 103 Categories~810,000 News articles 103 Categories
20. Aug,199620. Aug,1996 19. Aug,199714. April,199714. April,1997
Training PeriodTraining Period Test PeriodTest Period
504,468 articles504,468 articles 302,323 articles302,323 articles
Naïve Bayes, Macroaverages
0.34
0.39
0.44
0.49
0.54
10 100 1000 10000 100000
Number of features kept
Av
era
ge
F1
ov
er
the
te
st
se
t
odds ratio information gain normal-1 perceptron-1
Experiments with Naïve Bayes Classifier
• Benefits from feature selection
• SVM-normal gives the best performance
InfoGain
OddsRatio
SVM Normal
PercNormal
Perceptron, Macroaverages
0.05
0.15
0.25
0.35
0.45
0.55
10 100 1000 10000 100000
Number of features kept
Av
era
ge
F1
ov
er
the
te
st
se
t
odds ratio information gain normal-1 perceptron-1
Experiments with Perceptron Classifier
• Does not benefit from feature selection
• Perceptron and SVM Normal feature selection give comparable performance
InfoGain
OddsRatio
SVM Normal
PercNormal
Linear SVM, Macroaverages
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
10 100 1000 10000 100000
Number of features kept
Av
era
ge
F1
ov
er
the
te
st
se
t
odds ratio information gain normal-1 perceptron-1
Experiments with the Linear SVM Classifier
• Does not benefit from feature selection
• SVM-normal the best performance
InfoGain
OddsRatioSVM Normal
PercNormal
DiscussionUsing discarded features can help
The features that harm performance if used as input were found to improve performance if used as additional output
obtain additional information by introducing mapping from the selected features to the discarded features (the multitask learning setting [Caruana de Sa 2003])
experiments on synthetic regression and classification problems and real-world medical data have shown improvements in performance
Intuition: transfer of information occurs inside the model, when in addition to the class value it models also additional output consisting of the discarded features
DiscussionFeature subset selection as pre-processing ignore interaction with the target learning algorithm
Simple Filters – work for large number of features assume feature independence, limited results the size of feature subset to be determined
Filters – search space of size , can not handle many features relay on general data characteristics (consistency, distance, class
distribution) use the target learning algorithm for evaluation
Wrappers – high accuracy, computationally expensive use model selection with cross-validation of the target algorithm, similar
to metric-based model selection (eg., comparing output on training and on unlabeled data)
Feature subset selection during learning use the target learning algorithm during feature selection
Embedded – can be used by filters to find the feature subset
F2