Feature Selection: Why? Text collections have a large number of features 10,000 – 1,000,000 unique...

Feature Selection: Why?

Text collections have a large number of features 10,000 – 1,000,000 unique words … and more

May make using a particular classifier feasible Some classifiers can’t deal with 100,000 of features

Reduces training time Training time for some methods is quadratic or

worse in the number of features Can improve generalization (performance)

Eliminates noise features Avoids overfitting

Think the two NB models

13.5

Basic feature selection algorithm

For a given class c, we compute a utility measure A(t,c) for each term of the vocabulary

Select the k terms that have the highest values of A(t,c)

SELECTFEATURES(D, c, k)1 V ← EXTRACTVOCABULARY(D)2 L ← []3 for each t ∈ V4 do A(t, c) ← COMPUTEFEATUREUTILITY(D, t, c)5 APPEND(L, A(t, c), ti)6 return FEATURESWITHLARGESTVALUES(L, k)

Feature selection: how?

Three utility measures: Information theory:

How much information does the value of one categorical variable give you about the value of another

Mutual information Hypothesis testing statistics:

Are we confident that the value of one categorical variable is associated with the value of another

Chi-square test Frequency

13.5

Feature selection via Mutual Information

In training set, choose k words which best discriminate (give most info on) the categories.

The Mutual Information between a word, class is:

For each word w and each category c For MLEs of the probabilities

}1,0{ }1,0{ )()(

),(log),(),(

w ce e cw

cwcw epep

eepeepcwI

0..0

002

00

0..1

102

10

1..0

012

01

1..1

112

11 loglogloglog);(NN

NN

N

N

NN

NN

N

N

NN

NN

N

N

NN

NN

N

NCUI

Feature selection via Mutual Information (example)

Feature selection via Mutual Information

Mutual information measures how much information – in the information theoretic sense If a term’s distribution is the same in the class as it

is in the collection as a whole, then I(U; C) =0. MI reaches its maximum value if the term is a

perfect indicator for class membership if the term is present in a document if and only if the

document is in the class.

2 statistic

In statistics, the χ2 test is applied to test the

independence of two events. In feature selection, the two events are occurrence of the term

and occurrence of the class

2 statistic(example)

2 statistic(example)

X2 is a measure of how much expected counts E and observed counts N deviate from each other. A high value of X2 indicates that the hypothesis of

independence, which implies that expected and observed counts are similar, is incorrect.

Frequency-based feature selection

Frequency-based feature selection Selecting the terms that are most common in the class

Frequency can be either document frequency (the number of documents in the class

c that contain the term t) or as collection frequency (the number of tokens of t that

occur in documents in c) Discussions

Frequency-based feature selection selects some frequent terms that have no specific information about the class

the days of the week (Monday, Tuesday, . . . ), which are frequent across classes in newswire text.

When many thousands of features are selected, then frequency-based feature selection often does well.

If somewhat suboptimal accuracy is acceptable, then frequency-based feature selection can be a good alternative to more complex methods.

Comparison of feature selection methods


χ2 selects more rare terms than mutual information The independence of term t and class c can

sometimes be rejected with high confidence even if t carries little information about membership of a document in c.


All three methods – MI, χ2 and frequency-based– are greedy methods.

Feature selection for NB

In general feature selection is necessary for multivariate Bernoulli NB.

“Feature selection” really means something different for multinomial NB. It means dictionary truncation

Evaluating Categorization

Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).

Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.

Adequate if one class per document Otherwise F measure for each class

Results can vary based on sampling error due to different training and test sets.

13.6

Classifier Accuracy Measures

Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M

Error rate (misclassification rate) of M = 1 – acc(M) Given m classes, CMi,j, an entry in a confusion matrix, indicates # of tuples

in class i that are labeled by the classifier as class j Alternative accuracy measures (e.g., for cancer diagnosis)

sensitivity = t-pos/pos /* true positive recognition rate */

specificity = t-neg/neg /* true negative recognition rate */

precision = t-pos/(t-pos + f-pos)

accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) This model can also be used for cost-benefit analysis

classes buy_computer = yes buy_computer = no total recognition(%)

buy_computer = yes 6954 46 7000 99.34

buy_computer = no 412 2588 3000 86.27

total 7366 2634 10000 95.42

(classifier)C1 C2

(true)C1 True positive False negative

C2 False positive True negative

Evaluating the Accuracy of a Classifier

Holdout method Given data is randomly partitioned into two independent sets

Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation

Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets,

each approximately equal size At i-th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data

Evaluating the Accuracy of a Classifier or Predictor (II) Bootstrap

Works well with small data sets Samples the given training tuples uniformly with replacement

i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set

Several boostrap methods, and a common one is .632 boostrap Suppose we are given a data set of d tuples. The data set is sampled d times, with

replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)

Repeat the sampling procedue k times, overall accuracy of the model:

))(368.0)(632.0()( _1

_ settraini

k

isettesti MaccMaccMacc

Naïve Bayes on spam email

13.6

Violation of NB Assumptions

Conditional independence Examples?

Example: Sensors

NB FACTORS: P(s) = 1/2 P(+|s) = 1/4 P(+|r) = 3/4

Raining

Sunny

P(+,+,r) = 3/8 P(+,+,s) = 1/8

Reality

P(-,-,r) = 1/8 P(-,-,s) = 3/8

Raining?

M1 M2

NB Model PREDICTIONS: P(r,+,+) = (½)(¾)

(¾) P(s,+,+) = (½)(¼)

(¼) P(r|+,+) = 9/10 P(s|+,+) = 1/10

Naïve Bayes Posterior Probabilities

Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.

Correct estimation accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities)

Naive Bayes is Not So Naive Naïve Bayes: First and Second place in KDD-CUP 97 competition, among

16 (then) state of the art algorithmsGoal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

Robust to Irrelevant FeaturesIrrelevant Features cancel each other without affecting resultsInstead Decision Trees can heavily suffer from this.

Very good in domains with many equally important featuresDecision Trees suffer from fragmentation in such cases – especially if little data

A good dependable baseline for text classification (but not the best)! Optimal if the Independence Assumptions hold: If assumed independence is

correct, then it is the Bayes Optimal Classifier for problem Very Fast: Learning with one pass of counting over the data; testing linear in the

number of attributes, and document collection size Low Storage requirements

Bayesian Belief Networks Bayesian belief network allows a subset of the variables

conditionally independent

A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution

X Y

ZP

Nodes: random variables Links: dependency X and Y are the parents of Z, and

Y is the parent of P No dependency between Z and P Has no loops or cycles

Bayesian Belief Network: An Example

FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table (CPT) for variable LungCancer:

n

iYParents ixiPxxP n

1))(|(),...,( 1

CPT shows the conditional probability for each possible combination of its parents

Derivation of the probability of a particular combination of values of X, from CPT:

Resources IIR 13 Fabrizio Sebastiani. Machine Learning in Automated Text

Categorization. ACM Computing Surveys, 34(1):1-47, 2002. Yiming Yang & Xin Liu, A re-examination of text categorization

methods. Proceedings of SIGIR, 1999. Andrew McCallum and Kamal Nigam. A Comparison of Event

Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.

Tom Mitchell, Machine Learning. McGraw-Hill, 1997. Clear simple explanation of Naïve Bayes

Open Calais: Automatic Semantic Tagging Free (but they can keep your data), provided by Thompson/Reuters

Weka: A data mining software package that includes an implementation of Naive Bayes

Reuters-21578 – the most famous text classification evaluation set and still widely used by lazy people (but now it’s too small for realistic experiments – you should use Reuters RCV1)

Classification by decision tree induction

Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example of Quinlan’s ID3

Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

<=30 >40

no yes yes

yes

31..40

no

fairexcellentyesno

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in

advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting

is employed for classifying the leaf

Attribute Selection Measure: Information Gain (ID3/C4.5)

Select the attribute with the highest information gain

Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|

Expected information (entropy) needed to classify a tuple in D:

Information needed (after using A to split D into v partitions) to classify D:

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo

)(||

||)(

1j

v

j

jA DI

D

DDInfo

(D)InfoInfo(D)Gain(A) A

Attribute Selection: Information Gain

Class P: buys_computer = “yes” Class N: buys_computer = “no”

means “age <=30” has 5

out of 14 samples, with 2 yes’es

and 3 no’s. Hence

Similarly,

age pi ni I(pi, ni)<=30 2 3 0.97131…40 4 0 0>40 3 2 0.971

694.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIDInfoage

048.0)_(

151.0)(

029.0)(

ratingcreditGain

studentGain

incomeGain

246.0)()()( DInfoDInfoageGain ageage income student credit_rating buys_computer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

)3,2(14

5I

940.0)14

5(log

14

5)

14

9(log

14

9)5,9()( 22 IDInfo

Gain Ratio for Attribute Selection (C4.5)

Information gain measure is biased towards attributes with a large number of values

C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)

GainRatio(A) = Gain(A)/SplitInfo(A)

Ex. gain_ratio(income) = 0.029/0.926 = 0.031

The attribute with the maximum gain ratio is selected as the splitting attribute

)||

||(log

||

||)( 2

1 D

D

D

DDSplitInfo j

v

j

jA

926.0)14

4(log

14

4)

14

6(log

14

6)

14

4(log

14

4)( 222 DSplitInfoA

Gini index (CART, IBM IntelligentMiner)

If a data set D contains examples from n classes, gini index, gini(D) is defined as

where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index

gini(D) is defined as

Reduction in Impurity:

The attribute provides the smallest ginisplit(D) (or the largest reduction in

impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

n

jp jDgini

1

21)(

)(||||)(

||||)( 2

21

1 DginiDD

DginiDDDginiA

)()()( DginiDginiAginiA

Gini index (CART, IBM IntelligentMiner)

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

Suppose the attribute income partitions D into 10 in D1: {low,

medium} and 4 in D2

but gini{medium,high} is 0.30 and thus the best since it is the lowest

459.014

5

14

91)(

22

Dgini

)(14

4)(

14

10)( 11},{ DGiniDGiniDgini mediumlowincome

The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D space The nearest neighbor are defined in terms of Euclidean

distance, dist(X1, X2) Target function could be discrete- or real- valued For discrete-valued, k-NN returns the most common value

among the k training examples nearest to xq

Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples

.

_+

_ xq

+

_ _+

_

_

+

.

..

. .

Discussion on the k-NN Algorithm

k-NN for real-valued prediction for a given unknown tuple Returns the mean values of the k nearest neighbors

Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to

their distance to the query xq

Give greater weight to closer neighbors

Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be

dominated by irrelevant attributes To overcome it, elimination of the least relevant attributes

2),(1

ixqxdw

Genetic Algorithms (GA) Genetic Algorithm: based on an analogy to biological evolution An initial population is created consisting of randomly generated rules

Each rule is represented by a string of bits

E.g., if A1 and ¬A2 then C2 can be encoded as 100

If an attribute has k > 2 values, k bits can be used

Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offsprings

The fitness of a rule is represented by its classification accuracy on a set of training examples

Offsprings are generated by crossover and mutation The process continues until a population P evolves when each rule in P

satisfies a prespecified threshold Slow but easily parallelizable

Feature Selection: Why? Text collections have a large number of features 10,000 – 1,000,000 unique...

Documents

Transcript of Feature Selection: Why? Text collections have a large number of features 10,000 – 1,000,000 unique...