Feature Selection: Why? Text collections have a large number of features 10,000 – 1,000,000 unique...
-
Upload
madeline-stanley -
Category
Documents
-
view
225 -
download
0
Transcript of Feature Selection: Why? Text collections have a large number of features 10,000 – 1,000,000 unique...
Feature Selection: Why?
Text collections have a large number of features 10,000 – 1,000,000 unique words … and more
May make using a particular classifier feasible Some classifiers can’t deal with 100,000 of features
Reduces training time Training time for some methods is quadratic or
worse in the number of features Can improve generalization (performance)
Eliminates noise features Avoids overfitting
Think the two NB models
13.5
Basic feature selection algorithm
For a given class c, we compute a utility measure A(t,c) for each term of the vocabulary
Select the k terms that have the highest values of A(t,c)
SELECTFEATURES(D, c, k)1 V ← EXTRACTVOCABULARY(D)2 L ← []3 for each t ∈ V4 do A(t, c) ← COMPUTEFEATUREUTILITY(D, t, c)5 APPEND(L, A(t, c), ti)6 return FEATURESWITHLARGESTVALUES(L, k)
Feature selection: how?
Three utility measures: Information theory:
How much information does the value of one categorical variable give you about the value of another
Mutual information Hypothesis testing statistics:
Are we confident that the value of one categorical variable is associated with the value of another
Chi-square test Frequency
13.5
Feature selection via Mutual Information
In training set, choose k words which best discriminate (give most info on) the categories.
The Mutual Information between a word, class is:
For each word w and each category c For MLEs of the probabilities
}1,0{ }1,0{ )()(
),(log),(),(
w ce e cw
cwcw epep
eepeepcwI
0..0
002
00
0..1
102
10
1..0
012
01
1..1
112
11 loglogloglog);(NN
NN
N
N
NN
NN
N
N
NN
NN
N
N
NN
NN
N
NCUI
Feature selection via Mutual Information (example)
Feature selection via Mutual Information
Mutual information measures how much information – in the information theoretic sense If a term’s distribution is the same in the class as it
is in the collection as a whole, then I(U; C) =0. MI reaches its maximum value if the term is a
perfect indicator for class membership if the term is present in a document if and only if the
document is in the class.
2 statistic
In statistics, the χ2 test is applied to test the
independence of two events. In feature selection, the two events are occurrence of the term
and occurrence of the class
2 statistic(example)
2 statistic(example)
X2 is a measure of how much expected counts E and observed counts N deviate from each other. A high value of X2 indicates that the hypothesis of
independence, which implies that expected and observed counts are similar, is incorrect.
Frequency-based feature selection
Frequency-based feature selection Selecting the terms that are most common in the class
Frequency can be either document frequency (the number of documents in the class
c that contain the term t) or as collection frequency (the number of tokens of t that
occur in documents in c) Discussions
Frequency-based feature selection selects some frequent terms that have no specific information about the class
the days of the week (Monday, Tuesday, . . . ), which are frequent across classes in newswire text.
When many thousands of features are selected, then frequency-based feature selection often does well.
If somewhat suboptimal accuracy is acceptable, then frequency-based feature selection can be a good alternative to more complex methods.
Comparison of feature selection methods
Comparison of feature selection methods
χ2 selects more rare terms than mutual information The independence of term t and class c can
sometimes be rejected with high confidence even if t carries little information about membership of a document in c.
Comparison of feature selection methods
All three methods – MI, χ2 and frequency-based– are greedy methods.
Feature selection for NB
In general feature selection is necessary for multivariate Bernoulli NB.
“Feature selection” really means something different for multinomial NB. It means dictionary truncation
Evaluating Categorization
Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).
Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.
Adequate if one class per document Otherwise F measure for each class
Results can vary based on sampling error due to different training and test sets.
13.6
Classifier Accuracy Measures
Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M
Error rate (misclassification rate) of M = 1 – acc(M) Given m classes, CMi,j, an entry in a confusion matrix, indicates # of tuples
in class i that are labeled by the classifier as class j Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos /* true positive recognition rate */
specificity = t-neg/neg /* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) This model can also be used for cost-benefit analysis
classes buy_computer = yes buy_computer = no total recognition(%)
buy_computer = yes 6954 46 7000 99.34
buy_computer = no 412 2588 3000 86.27
total 7366 2634 10000 95.42
(classifier)C1 C2
(true)C1 True positive False negative
C2 False positive True negative
Evaluating the Accuracy of a Classifier
Holdout method Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation
Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets,
each approximately equal size At i-th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data
Evaluating the Accuracy of a Classifier or Predictor (II) Bootstrap
Works well with small data sets Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set
Several boostrap methods, and a common one is .632 boostrap Suppose we are given a data set of d tuples. The data set is sampled d times, with
replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedue k times, overall accuracy of the model:
))(368.0)(632.0()( _1
_ settraini
k
isettesti MaccMaccMacc
Naïve Bayes on spam email
13.6
Violation of NB Assumptions
Conditional independence Examples?
Example: Sensors
NB FACTORS: P(s) = 1/2 P(+|s) = 1/4 P(+|r) = 3/4
Raining
Sunny
P(+,+,r) = 3/8 P(+,+,s) = 1/8
Reality
P(-,-,r) = 1/8 P(-,-,s) = 3/8
Raining?
M1 M2
NB Model PREDICTIONS: P(r,+,+) = (½)(¾)
(¾) P(s,+,+) = (½)(¼)
(¼) P(r|+,+) = 9/10 P(s|+,+) = 1/10
Naïve Bayes Posterior Probabilities
Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.
Correct estimation accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities)
Naive Bayes is Not So Naive Naïve Bayes: First and Second place in KDD-CUP 97 competition, among
16 (then) state of the art algorithmsGoal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.
Robust to Irrelevant FeaturesIrrelevant Features cancel each other without affecting resultsInstead Decision Trees can heavily suffer from this.
Very good in domains with many equally important featuresDecision Trees suffer from fragmentation in such cases – especially if little data
A good dependable baseline for text classification (but not the best)! Optimal if the Independence Assumptions hold: If assumed independence is
correct, then it is the Bayes Optimal Classifier for problem Very Fast: Learning with one pass of counting over the data; testing linear in the
number of attributes, and document collection size Low Storage requirements
Bayesian Belief Networks Bayesian belief network allows a subset of the variables
conditionally independent
A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution
X Y
ZP
Nodes: random variables Links: dependency X and Y are the parents of Z, and
Y is the parent of P No dependency between Z and P Has no loops or cycles
Bayesian Belief Network: An Example
FamilyHistory
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table (CPT) for variable LungCancer:
n
iYParents ixiPxxP n
1))(|(),...,( 1
CPT shows the conditional probability for each possible combination of its parents
Derivation of the probability of a particular combination of values of X, from CPT:
Resources IIR 13 Fabrizio Sebastiani. Machine Learning in Automated Text
Categorization. ACM Computing Surveys, 34(1):1-47, 2002. Yiming Yang & Xin Liu, A re-examination of text categorization
methods. Proceedings of SIGIR, 1999. Andrew McCallum and Kamal Nigam. A Comparison of Event
Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.
Tom Mitchell, Machine Learning. McGraw-Hill, 1997. Clear simple explanation of Naïve Bayes
Open Calais: Automatic Semantic Tagging Free (but they can keep your data), provided by Thompson/Reuters
Weka: A data mining software package that includes an implementation of Naive Bayes
Reuters-21578 – the most famous text classification evaluation set and still widely used by lazy people (but now it’s too small for realistic experiments – you should use Reuters RCV1)
Classification by decision tree induction
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example of Quinlan’s ID3
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fairexcellentyesno
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in
advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain) Conditions for stopping partitioning
All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting
is employed for classifying the leaf
Attribute Selection Measure: Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
Information needed (after using A to split D into v partitions) to classify D:
Information gained by branching on attribute A
)(log)( 21
i
m
ii ppDInfo
)(||
||)(
1j
v
j
jA DI
D
DDInfo
(D)InfoInfo(D)Gain(A) A
Attribute Selection: Information Gain
Class P: buys_computer = “yes” Class N: buys_computer = “no”
means “age <=30” has 5
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Similarly,
age pi ni I(pi, ni)<=30 2 3 0.97131…40 4 0 0>40 3 2 0.971
694.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIDInfoage
048.0)_(
151.0)(
029.0)(
ratingcreditGain
studentGain
incomeGain
246.0)()()( DInfoDInfoageGain ageage income student credit_rating buys_computer
<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
)3,2(14
5I
940.0)14
5(log
14
5)
14
9(log
14
9)5,9()( 22 IDInfo
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex. gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as the splitting attribute
)||
||(log
||
||)( 2
1 D
D
D
DDSplitInfo j
v
j
jA
926.0)14
4(log
14
4)
14
6(log
14
6)
14
4(log
14
4)( 222 DSplitInfoA
Gini index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D) is defined as
where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
Reduction in Impurity:
The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)
n
jp jDgini
1
21)(
)(||||)(
||||)( 2
21
1 DginiDD
DginiDDDginiA
)()()( DginiDginiAginiA
Gini index (CART, IBM IntelligentMiner)
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2
but gini{medium,high} is 0.30 and thus the best since it is the lowest
459.014
5
14
91)(
22
Dgini
)(14
4)(
14
10)( 11},{ DGiniDGiniDgini mediumlowincome
The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D space The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2) Target function could be discrete- or real- valued For discrete-valued, k-NN returns the most common value
among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples
.
_+
_ xq
+
_ _+
_
_
+
.
..
. .
Discussion on the k-NN Algorithm
k-NN for real-valued prediction for a given unknown tuple Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to
their distance to the query xq
Give greater weight to closer neighbors
Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes To overcome it, elimination of the least relevant attributes
2),(1
ixqxdw
Genetic Algorithms (GA) Genetic Algorithm: based on an analogy to biological evolution An initial population is created consisting of randomly generated rules
Each rule is represented by a string of bits
E.g., if A1 and ¬A2 then C2 can be encoded as 100
If an attribute has k > 2 values, k bits can be used
Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offsprings
The fitness of a rule is represented by its classification accuracy on a set of training examples
Offsprings are generated by crossover and mutation The process continues until a population P evolves when each rule in P
satisfies a prespecified threshold Slow but easily parallelizable