1 25 Feature

8/7/2019 1 25 Feature

1/46

Feature selection

LING 572

Fei XiaWeek 4: 1/25/2011

1

8/7/2019 1 25 Feature

2/46

Creating attribute-value table

Choose features: Define feature templates

Instantiate the feature templates

Dimensionality reduction: feature selection

Feature weighting The weight for f k: the whole column

The weight for f k in d i: a cell

f1 f2 fK yx1x2

2

8/7/2019 1 25 Feature

3/46

An example:text classification task

Define feature templates: Ex: One template only: word

Instantiate the feature templates Ex: All the words appeared in the training (and test)data

Dimensionality reduction: feature selection Ex: Remove stop words

Feature weighting Ex: Feature value: term frequency (tf), or tf-idf

3

8/7/2019 1 25 Feature

4/46

Outline

Dimensionality reduction

Some scoring functions **

Chi-square score and Chi-square test

Hw4

In this lecture, we will use term and feature interchangeably.

4

8/7/2019 1 25 Feature

5/46

8/7/2019 1 25 Feature

6/46

Dimensionality reduction (DR)

What is DR? Given a feature set r, create a new set r, s.t.

r is much smaller than r, and the classification performance does not suffer too

much.

Why DR? ML algorithms do not scale well. DR can reduce overfitting.

6

8/7/2019 1 25 Feature

7/46

Types of DR

r is the original feature set, r is the oneafter DR.

Local DR vs. Global DR Global DR: r is the same for every

category

Local DR: a different r for each category

Term extraction vs. term selection7

8/7/2019 1 25 Feature

8/46

Term selection vs. extraction Term selection: r is a subset of r

Wrapping methods: score terms by training andevaluating classifiers.

expensive and classifier-dependent

Filtering methods

Term extraction: terms in r are obtained by combinationsor transformation of r terms. Term clustering: Latent semantic indexing (LSI)

8

8/7/2019 1 25 Feature

9/46

Term selection by filtering Main idea: scoring terms according to predetermined

numerical functions that measure the importance of theterms.

It is fast and classifier-independent.

Scoring functions: Information Gain Mutual information chi square

9

8/7/2019 1 25 Feature

10/46

Quick summary so far

DR: to reduce the number of features Local DR vs. global DR Term extraction vs. term selection

Term extraction Term clustering: Latent semantic indexing (LSI)

Term selection Wrapping method Filtering method: different functions

10

8/7/2019 1 25 Feature

11/46

Some scoring functions

11

8/7/2019 1 25 Feature

12/46

8/7/2019 1 25 Feature

13/46

Calculating basic distributions

13

8/7/2019 1 25 Feature

14/46

Term selection functions

Intuition: for a category c i , the mostvaluable terms are those that aredistributed most differently in the sets ofpositive and negative examples of c i.

14

8/7/2019 1 25 Feature

15/46

Term selection functions

15

8/7/2019 1 25 Feature

16/46

Information gain

IG(Y, X): We must transmit Y. How manybits on average would it save us if bothends of the line knew X?

Definition:

IG (Y, X) = H(Y) H(Y|X)

16

8/7/2019 1 25 Feature

17/46

Information gain

Information gain**

17

8/7/2019 1 25 Feature

18/46

More term selection functions**

18

8/7/2019 1 25 Feature

19/46

More term selection functions**

19

8/7/2019 1 25 Feature

20/46

Global DR

For local DR, calculate f(t k, c i).

For global DR, calculate one of the following:

20

|C| is the number of classes

8/7/2019 1 25 Feature

21/46

Which function works the best?

It depends on Classifiers Data

According to (Yang and Pedersen 1997):

21

8/7/2019 1 25 Feature

22/46

Feature weighting

22

8/7/2019 1 25 Feature

23/46

Alternative feature values (needmore work)

Binary features: 0 or 1.

Term frequency (TF): the number of times that t kappears in d i.

Inversed document frequency (IDF): log |D| /d k, where d kis the number of documents that contain t k.

TFIDF = TF * IDF

Normalized TFIDF:

23

8/7/2019 1 25 Feature

24/46

Feature weights

Feature weight 2 {0,1}: same as DR

Feature weight2

R: iterative approach: Ex: MaxEnt

Feature selection is a special case offeature weighting.

24

8/7/2019 1 25 Feature

25/46

Summary so far

Curse of dimensionality dimensionalityreduction (DR)

DR: Term extraction Term selection

Wrapping method Filtering method: different functions

25

8/7/2019 1 25 Feature

26/46

8/7/2019 1 25 Feature

27/46

Chi square

27

8/7/2019 1 25 Feature

28/46

Chi square

An example: is gender a good feature forpredicting footwear preference? A: gender B: footwear preference

Bivariate tabular analysis: Is there a relationship between two random variables

A and B in the data? How strong is the relationship? What is the direction of the relationship?

28

8/7/2019 1 25 Feature

29/46

Raw frequencies

sandal sneaker Leathershoe

boots others

male 6 17 13 9 5

female 13 5 7 16 9

29

Feature: male/female

Classes: {sandal, sneaker, .}

T di ib i

8/7/2019 1 25 Feature

30/46

Two distributions

Sandal Sneaker Leather Boot Others Total

Male 6 17 13 9 5 50

Female 13 5 7 16 9 50

Total 19 22 20 25 14 100


Male 50

Female 50

Total 19 22 20 25 14 100

Observed distribution (O):

Expected distribution (E):

30

T di ib i

8/7/2019 1 25 Feature

31/46

Two distributions


Male 6 17 13 9 5 50

Female 13 5 7 16 9 50

Total 19 22 20 25 14 100


Male 9.5 11 10 12.5 7 50

Female 9.5 11 10 12.5 7 50

Total 19 22 20 25 14 100

Observed distribution (O):

Expected distribution (E):

31

8/7/2019 1 25 Feature

32/46

Chi square

Expected value =row total * column total / table total

2 = ij (O ij - E ij)2 / E ij

2 = (6-9.5) 2 /9.5 + (17-11) 2/11+ .= 14.026

32

8/7/2019 1 25 Feature

33/46

Calculating 2

Fill out a contingency table of the observedvalues O

Compute the row totals and column totals

Calculate expected value for each cell assuming

no association E

Compute chi square: (O-E) 2 /E

33

8/7/2019 1 25 Feature

34/46

When r=2 and c=2

O =

E =

34

8/7/2019 1 25 Feature

35/46

2 test

35

8/7/2019 1 25 Feature

36/46

8/7/2019 1 25 Feature

37/46

Requirements

The events are assumed to be independent andhave the same distribution.

The outcomes of each event must be mutuallyexclusive.

At least 5 observations per cell.

Collect raw frequencies, not percentages

37

8/7/2019 1 25 Feature

38/46

Degree of freedom

Degree of freedom df = (r 1) (c 1)r: # of rows c: # of columns

In this Ex: df=(2-1) (5-1)=4

38

8/7/2019 1 25 Feature

39/46

2 distribution table

0.10 0.05 0.025 0.01 0.0011 2.706 3.841 5.024 6.635 10.8282 4.605 5.991 7.378 9.210 13.8163 6.251 7.815 9.348 11.345 16.2664 7.779 9.488 11.143 13.277 18.4675 9.236 11.070 12.833 15.086 20.5156 10.645 12.592 14.449 16.812 22.458

df=4 and 14.026 > 13.277p

8/7/2019 1 25 Feature

40/46

2 to P Calculator

http://faculty.vassar.edu/lowry/tabs.html#csq

40
http://faculty.vassar.edu/lowry/tabs.htmlhttp://faculty.vassar.edu/lowry/tabs.html

8/7/2019 1 25 Feature

41/46

Steps of 2 test Select significance level p 0

Calculate 2

Compute the degree of freedomdf = (r-1)(c-1)

Calculate p given 2 value (or get the 20 for p 0)

if p < p 0 (or if 2 > 20)then reject the null hypothesis.

41

8/7/2019 1 25 Feature

42/46

Summary of 2 test

A very common method for significant test

Many good tutorials online Ex: http://en.wikipedia.org/wiki/Chi-square_distribution

42
http://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distribution

8/7/2019 1 25 Feature

43/46

Hw4

43

8/7/2019 1 25 Feature

44/46

Hw4

Q1-Q3: kNN

Q4: chi-square for feature selection

Q5-Q6: The effect of feature selection on

kNN

Q7: Conclusion44

8/7/2019 1 25 Feature

45/46

Q1-Q3: kNN

The choice of k

The choice of similarity function: Euclidean distance: choose the smallest ones Cosine function: choose the largest ones

Binary vs. real-valued features

45

8/7/2019 1 25 Feature

46/46

Q4-Q6

Rank features by chi-square scores

Remove non-relevant features from the vectorfiles

Run kNN using the newly processed data

Compare the results with or without featureselection .

46

1 25 Feature

Documents

Transcript of 1 25 Feature