1 25 Feature
Transcript of 1 25 Feature
-
8/7/2019 1 25 Feature
1/46
Feature selection
LING 572
Fei XiaWeek 4: 1/25/2011
1
-
8/7/2019 1 25 Feature
2/46
Creating attribute-value table
Choose features: Define feature templates
Instantiate the feature templates
Dimensionality reduction: feature selection
Feature weighting The weight for f k: the whole column
The weight for f k in d i: a cell
f1 f2 fK yx1x2
2
-
8/7/2019 1 25 Feature
3/46
An example:text classification task
Define feature templates: Ex: One template only: word
Instantiate the feature templates Ex: All the words appeared in the training (and test)data
Dimensionality reduction: feature selection Ex: Remove stop words
Feature weighting Ex: Feature value: term frequency (tf), or tf-idf
3
-
8/7/2019 1 25 Feature
4/46
Outline
Dimensionality reduction
Some scoring functions **
Chi-square score and Chi-square test
Hw4
In this lecture, we will use term and feature interchangeably.
4
-
8/7/2019 1 25 Feature
5/46
-
8/7/2019 1 25 Feature
6/46
Dimensionality reduction (DR)
What is DR? Given a feature set r, create a new set r, s.t.
r is much smaller than r, and the classification performance does not suffer too
much.
Why DR? ML algorithms do not scale well. DR can reduce overfitting.
6
-
8/7/2019 1 25 Feature
7/46
Types of DR
r is the original feature set, r is the oneafter DR.
Local DR vs. Global DR Global DR: r is the same for every
category
Local DR: a different r for each category
Term extraction vs. term selection7
-
8/7/2019 1 25 Feature
8/46
Term selection vs. extraction Term selection: r is a subset of r
Wrapping methods: score terms by training andevaluating classifiers.
expensive and classifier-dependent
Filtering methods
Term extraction: terms in r are obtained by combinationsor transformation of r terms. Term clustering: Latent semantic indexing (LSI)
8
-
8/7/2019 1 25 Feature
9/46
Term selection by filtering Main idea: scoring terms according to predetermined
numerical functions that measure the importance of theterms.
It is fast and classifier-independent.
Scoring functions: Information Gain Mutual information chi square
9
-
8/7/2019 1 25 Feature
10/46
Quick summary so far
DR: to reduce the number of features Local DR vs. global DR Term extraction vs. term selection
Term extraction Term clustering: Latent semantic indexing (LSI)
Term selection Wrapping method Filtering method: different functions
10
-
8/7/2019 1 25 Feature
11/46
Some scoring functions
11
-
8/7/2019 1 25 Feature
12/46
-
8/7/2019 1 25 Feature
13/46
Calculating basic distributions
13
-
8/7/2019 1 25 Feature
14/46
Term selection functions
Intuition: for a category c i , the mostvaluable terms are those that aredistributed most differently in the sets ofpositive and negative examples of c i.
14
-
8/7/2019 1 25 Feature
15/46
Term selection functions
15
-
8/7/2019 1 25 Feature
16/46
Information gain
IG(Y, X): We must transmit Y. How manybits on average would it save us if bothends of the line knew X?
Definition:
IG (Y, X) = H(Y) H(Y|X)
16
-
8/7/2019 1 25 Feature
17/46
Information gain
Information gain**
17
-
8/7/2019 1 25 Feature
18/46
More term selection functions**
18
-
8/7/2019 1 25 Feature
19/46
More term selection functions**
19
-
8/7/2019 1 25 Feature
20/46
Global DR
For local DR, calculate f(t k, c i).
For global DR, calculate one of the following:
20
|C| is the number of classes
-
8/7/2019 1 25 Feature
21/46
Which function works the best?
It depends on Classifiers Data
According to (Yang and Pedersen 1997):
21
-
8/7/2019 1 25 Feature
22/46
Feature weighting
22
-
8/7/2019 1 25 Feature
23/46
Alternative feature values (needmore work)
Binary features: 0 or 1.
Term frequency (TF): the number of times that t kappears in d i.
Inversed document frequency (IDF): log |D| /d k, where d kis the number of documents that contain t k.
TFIDF = TF * IDF
Normalized TFIDF:
23
-
8/7/2019 1 25 Feature
24/46
Feature weights
Feature weight 2 {0,1}: same as DR
Feature weight2
R: iterative approach: Ex: MaxEnt
Feature selection is a special case offeature weighting.
24
-
8/7/2019 1 25 Feature
25/46
Summary so far
Curse of dimensionality dimensionalityreduction (DR)
DR: Term extraction Term selection
Wrapping method Filtering method: different functions
25
-
8/7/2019 1 25 Feature
26/46
-
8/7/2019 1 25 Feature
27/46
Chi square
27
-
8/7/2019 1 25 Feature
28/46
Chi square
An example: is gender a good feature forpredicting footwear preference? A: gender B: footwear preference
Bivariate tabular analysis: Is there a relationship between two random variables
A and B in the data? How strong is the relationship? What is the direction of the relationship?
28
-
8/7/2019 1 25 Feature
29/46
Raw frequencies
sandal sneaker Leathershoe
boots others
male 6 17 13 9 5
female 13 5 7 16 9
29
Feature: male/female
Classes: {sandal, sneaker, .}
T di ib i
-
8/7/2019 1 25 Feature
30/46
Two distributions
Sandal Sneaker Leather Boot Others Total
Male 6 17 13 9 5 50
Female 13 5 7 16 9 50
Total 19 22 20 25 14 100
Sandal Sneaker Leather Boot Others Total
Male 50
Female 50
Total 19 22 20 25 14 100
Observed distribution (O):
Expected distribution (E):
30
T di ib i
-
8/7/2019 1 25 Feature
31/46
Two distributions
Sandal Sneaker Leather Boot Others Total
Male 6 17 13 9 5 50
Female 13 5 7 16 9 50
Total 19 22 20 25 14 100
Sandal Sneaker Leather Boot Others Total
Male 9.5 11 10 12.5 7 50
Female 9.5 11 10 12.5 7 50
Total 19 22 20 25 14 100
Observed distribution (O):
Expected distribution (E):
31
-
8/7/2019 1 25 Feature
32/46
Chi square
Expected value =row total * column total / table total
2 = ij (O ij - E ij)2 / E ij
2 = (6-9.5) 2 /9.5 + (17-11) 2/11+ .= 14.026
32
-
8/7/2019 1 25 Feature
33/46
Calculating 2
Fill out a contingency table of the observedvalues O
Compute the row totals and column totals
Calculate expected value for each cell assuming
no association E
Compute chi square: (O-E) 2 /E
33
-
8/7/2019 1 25 Feature
34/46
When r=2 and c=2
O =
E =
34
-
8/7/2019 1 25 Feature
35/46
2 test
35
-
8/7/2019 1 25 Feature
36/46
-
8/7/2019 1 25 Feature
37/46
Requirements
The events are assumed to be independent andhave the same distribution.
The outcomes of each event must be mutuallyexclusive.
At least 5 observations per cell.
Collect raw frequencies, not percentages
37
-
8/7/2019 1 25 Feature
38/46
Degree of freedom
Degree of freedom df = (r 1) (c 1)r: # of rows c: # of columns
In this Ex: df=(2-1) (5-1)=4
38
-
8/7/2019 1 25 Feature
39/46
2 distribution table
0.10 0.05 0.025 0.01 0.0011 2.706 3.841 5.024 6.635 10.8282 4.605 5.991 7.378 9.210 13.8163 6.251 7.815 9.348 11.345 16.2664 7.779 9.488 11.143 13.277 18.4675 9.236 11.070 12.833 15.086 20.5156 10.645 12.592 14.449 16.812 22.458
df=4 and 14.026 > 13.277p
-
8/7/2019 1 25 Feature
40/46
2 to P Calculator
http://faculty.vassar.edu/lowry/tabs.html#csq
40
http://faculty.vassar.edu/lowry/tabs.htmlhttp://faculty.vassar.edu/lowry/tabs.html -
8/7/2019 1 25 Feature
41/46
Steps of 2 test Select significance level p 0
Calculate 2
Compute the degree of freedomdf = (r-1)(c-1)
Calculate p given 2 value (or get the 20 for p 0)
if p < p 0 (or if 2 > 20)then reject the null hypothesis.
41
-
8/7/2019 1 25 Feature
42/46
Summary of 2 test
A very common method for significant test
Many good tutorials online Ex: http://en.wikipedia.org/wiki/Chi-square_distribution
42
http://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distribution -
8/7/2019 1 25 Feature
43/46
Hw4
43
-
8/7/2019 1 25 Feature
44/46
Hw4
Q1-Q3: kNN
Q4: chi-square for feature selection
Q5-Q6: The effect of feature selection on
kNN
Q7: Conclusion44
-
8/7/2019 1 25 Feature
45/46
Q1-Q3: kNN
The choice of k
The choice of similarity function: Euclidean distance: choose the smallest ones Cosine function: choose the largest ones
Binary vs. real-valued features
45
-
8/7/2019 1 25 Feature
46/46
Q4-Q6
Rank features by chi-square scores
Remove non-relevant features from the vectorfiles
Run kNN using the newly processed data
Compare the results with or without featureselection .
46