1 25 Feature

download 1 25 Feature

of 46

Transcript of 1 25 Feature

  • 8/7/2019 1 25 Feature

    1/46

    Feature selection

    LING 572

    Fei XiaWeek 4: 1/25/2011

    1

  • 8/7/2019 1 25 Feature

    2/46

    Creating attribute-value table

    Choose features: Define feature templates

    Instantiate the feature templates

    Dimensionality reduction: feature selection

    Feature weighting The weight for f k: the whole column

    The weight for f k in d i: a cell

    f1 f2 fK yx1x2

    2

  • 8/7/2019 1 25 Feature

    3/46

    An example:text classification task

    Define feature templates: Ex: One template only: word

    Instantiate the feature templates Ex: All the words appeared in the training (and test)data

    Dimensionality reduction: feature selection Ex: Remove stop words

    Feature weighting Ex: Feature value: term frequency (tf), or tf-idf

    3

  • 8/7/2019 1 25 Feature

    4/46

    Outline

    Dimensionality reduction

    Some scoring functions **

    Chi-square score and Chi-square test

    Hw4

    In this lecture, we will use term and feature interchangeably.

    4

  • 8/7/2019 1 25 Feature

    5/46

  • 8/7/2019 1 25 Feature

    6/46

    Dimensionality reduction (DR)

    What is DR? Given a feature set r, create a new set r, s.t.

    r is much smaller than r, and the classification performance does not suffer too

    much.

    Why DR? ML algorithms do not scale well. DR can reduce overfitting.

    6

  • 8/7/2019 1 25 Feature

    7/46

    Types of DR

    r is the original feature set, r is the oneafter DR.

    Local DR vs. Global DR Global DR: r is the same for every

    category

    Local DR: a different r for each category

    Term extraction vs. term selection7

  • 8/7/2019 1 25 Feature

    8/46

    Term selection vs. extraction Term selection: r is a subset of r

    Wrapping methods: score terms by training andevaluating classifiers.

    expensive and classifier-dependent

    Filtering methods

    Term extraction: terms in r are obtained by combinationsor transformation of r terms. Term clustering: Latent semantic indexing (LSI)

    8

  • 8/7/2019 1 25 Feature

    9/46

    Term selection by filtering Main idea: scoring terms according to predetermined

    numerical functions that measure the importance of theterms.

    It is fast and classifier-independent.

    Scoring functions: Information Gain Mutual information chi square

    9

  • 8/7/2019 1 25 Feature

    10/46

    Quick summary so far

    DR: to reduce the number of features Local DR vs. global DR Term extraction vs. term selection

    Term extraction Term clustering: Latent semantic indexing (LSI)

    Term selection Wrapping method Filtering method: different functions

    10

  • 8/7/2019 1 25 Feature

    11/46

    Some scoring functions

    11

  • 8/7/2019 1 25 Feature

    12/46

  • 8/7/2019 1 25 Feature

    13/46

    Calculating basic distributions

    13

  • 8/7/2019 1 25 Feature

    14/46

    Term selection functions

    Intuition: for a category c i , the mostvaluable terms are those that aredistributed most differently in the sets ofpositive and negative examples of c i.

    14

  • 8/7/2019 1 25 Feature

    15/46

    Term selection functions

    15

  • 8/7/2019 1 25 Feature

    16/46

    Information gain

    IG(Y, X): We must transmit Y. How manybits on average would it save us if bothends of the line knew X?

    Definition:

    IG (Y, X) = H(Y) H(Y|X)

    16

  • 8/7/2019 1 25 Feature

    17/46

    Information gain

    Information gain**

    17

  • 8/7/2019 1 25 Feature

    18/46

    More term selection functions**

    18

  • 8/7/2019 1 25 Feature

    19/46

    More term selection functions**

    19

  • 8/7/2019 1 25 Feature

    20/46

    Global DR

    For local DR, calculate f(t k, c i).

    For global DR, calculate one of the following:

    20

    |C| is the number of classes

  • 8/7/2019 1 25 Feature

    21/46

    Which function works the best?

    It depends on Classifiers Data

    According to (Yang and Pedersen 1997):

    21

  • 8/7/2019 1 25 Feature

    22/46

    Feature weighting

    22

  • 8/7/2019 1 25 Feature

    23/46

    Alternative feature values (needmore work)

    Binary features: 0 or 1.

    Term frequency (TF): the number of times that t kappears in d i.

    Inversed document frequency (IDF): log |D| /d k, where d kis the number of documents that contain t k.

    TFIDF = TF * IDF

    Normalized TFIDF:

    23

  • 8/7/2019 1 25 Feature

    24/46

    Feature weights

    Feature weight 2 {0,1}: same as DR

    Feature weight2

    R: iterative approach: Ex: MaxEnt

    Feature selection is a special case offeature weighting.

    24

  • 8/7/2019 1 25 Feature

    25/46

    Summary so far

    Curse of dimensionality dimensionalityreduction (DR)

    DR: Term extraction Term selection

    Wrapping method Filtering method: different functions

    25

  • 8/7/2019 1 25 Feature

    26/46

  • 8/7/2019 1 25 Feature

    27/46

    Chi square

    27

  • 8/7/2019 1 25 Feature

    28/46

    Chi square

    An example: is gender a good feature forpredicting footwear preference? A: gender B: footwear preference

    Bivariate tabular analysis: Is there a relationship between two random variables

    A and B in the data? How strong is the relationship? What is the direction of the relationship?

    28

  • 8/7/2019 1 25 Feature

    29/46

    Raw frequencies

    sandal sneaker Leathershoe

    boots others

    male 6 17 13 9 5

    female 13 5 7 16 9

    29

    Feature: male/female

    Classes: {sandal, sneaker, .}

    T di ib i

  • 8/7/2019 1 25 Feature

    30/46

    Two distributions

    Sandal Sneaker Leather Boot Others Total

    Male 6 17 13 9 5 50

    Female 13 5 7 16 9 50

    Total 19 22 20 25 14 100

    Sandal Sneaker Leather Boot Others Total

    Male 50

    Female 50

    Total 19 22 20 25 14 100

    Observed distribution (O):

    Expected distribution (E):

    30

    T di ib i

  • 8/7/2019 1 25 Feature

    31/46

    Two distributions

    Sandal Sneaker Leather Boot Others Total

    Male 6 17 13 9 5 50

    Female 13 5 7 16 9 50

    Total 19 22 20 25 14 100

    Sandal Sneaker Leather Boot Others Total

    Male 9.5 11 10 12.5 7 50

    Female 9.5 11 10 12.5 7 50

    Total 19 22 20 25 14 100

    Observed distribution (O):

    Expected distribution (E):

    31

  • 8/7/2019 1 25 Feature

    32/46

    Chi square

    Expected value =row total * column total / table total

    2 = ij (O ij - E ij)2 / E ij

    2 = (6-9.5) 2 /9.5 + (17-11) 2/11+ .= 14.026

    32

  • 8/7/2019 1 25 Feature

    33/46

    Calculating 2

    Fill out a contingency table of the observedvalues O

    Compute the row totals and column totals

    Calculate expected value for each cell assuming

    no association E

    Compute chi square: (O-E) 2 /E

    33

  • 8/7/2019 1 25 Feature

    34/46

    When r=2 and c=2

    O =

    E =

    34

  • 8/7/2019 1 25 Feature

    35/46

    2 test

    35

  • 8/7/2019 1 25 Feature

    36/46

  • 8/7/2019 1 25 Feature

    37/46

    Requirements

    The events are assumed to be independent andhave the same distribution.

    The outcomes of each event must be mutuallyexclusive.

    At least 5 observations per cell.

    Collect raw frequencies, not percentages

    37

  • 8/7/2019 1 25 Feature

    38/46

    Degree of freedom

    Degree of freedom df = (r 1) (c 1)r: # of rows c: # of columns

    In this Ex: df=(2-1) (5-1)=4

    38

  • 8/7/2019 1 25 Feature

    39/46

    2 distribution table

    0.10 0.05 0.025 0.01 0.0011 2.706 3.841 5.024 6.635 10.8282 4.605 5.991 7.378 9.210 13.8163 6.251 7.815 9.348 11.345 16.2664 7.779 9.488 11.143 13.277 18.4675 9.236 11.070 12.833 15.086 20.5156 10.645 12.592 14.449 16.812 22.458

    df=4 and 14.026 > 13.277p

  • 8/7/2019 1 25 Feature

    40/46

    2 to P Calculator

    http://faculty.vassar.edu/lowry/tabs.html#csq

    40

    http://faculty.vassar.edu/lowry/tabs.htmlhttp://faculty.vassar.edu/lowry/tabs.html
  • 8/7/2019 1 25 Feature

    41/46

    Steps of 2 test Select significance level p 0

    Calculate 2

    Compute the degree of freedomdf = (r-1)(c-1)

    Calculate p given 2 value (or get the 20 for p 0)

    if p < p 0 (or if 2 > 20)then reject the null hypothesis.

    41

  • 8/7/2019 1 25 Feature

    42/46

    Summary of 2 test

    A very common method for significant test

    Many good tutorials online Ex: http://en.wikipedia.org/wiki/Chi-square_distribution

    42

    http://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distributionhttp://en.wikipedia.org/wiki/Chi-square_distribution
  • 8/7/2019 1 25 Feature

    43/46

    Hw4

    43

  • 8/7/2019 1 25 Feature

    44/46

    Hw4

    Q1-Q3: kNN

    Q4: chi-square for feature selection

    Q5-Q6: The effect of feature selection on

    kNN

    Q7: Conclusion44

  • 8/7/2019 1 25 Feature

    45/46

    Q1-Q3: kNN

    The choice of k

    The choice of similarity function: Euclidean distance: choose the smallest ones Cosine function: choose the largest ones

    Binary vs. real-valued features

    45

  • 8/7/2019 1 25 Feature

    46/46

    Q4-Q6

    Rank features by chi-square scores

    Remove non-relevant features from the vectorfiles

    Run kNN using the newly processed data

    Compare the results with or without featureselection .

    46