Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

Feature SelectionAdvanced Statistical Methods in NLP

Ling 572January 24, 2012

RoadmapFeature representations:

Features in attribute-value matricesMotivation: text classification

Managing featuresGeneral approachesFeature selection techniques

Feature scoring measures Alternative feature weighting

Chi-squared feature selection

Representing Input:Attribute-Value Matrix

f1

Currency

f2

Country

… fm

Date

Label

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

..

xn=Doc4 0 0 0 2 NotSpam

Choosing features:• Define features – i.e. with feature templates


f1

Currency

f2

Country

… fm

Date

Label

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

..


Choosing features:• Define features – i.e. with feature templates• Instantiate features


f1

Currency

f2

Country

… fm

Date

Label

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

..


Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reduction


f1

Currency

f2

Country

… fm

Date

Label

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

..


Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reductionWeighting features: increase/decrease feature import


f1

Currency

f2

Country

… fm

Date

Label

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

..


Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reductionWeighting features: increase/decrease feature import• Global feature weighting: weight whole column• Local feature weighting: weight cell, conditions

Feature Selection ExampleTask: Text classification

Feature template definition:


Feature template definition:Word – just one template

Feature instantiation:



Feature instantiation:Words from training (and test?) data

Feature selection:



Feature instantiation:Words from training (and test?) data

Feature selection:Stopword removal: remove top K (~100) highest freq

Words like: the, a, have, is, to, for,…

Feature weighting:

Feature Selection Example Task: Text classification

Feature template definition: Word – just one template

Feature instantiation: Words from training (and test?) data

Feature selection: Stopword removal: remove top K (~100) highest freq

Words like: the, a, have, is, to, for,…

Feature weighting: Apply tf*idf feature weighting

tf = term frequency; idf = inverse document frequency

The Curse of Dimensionality

Think of the instances as vectors of features# of features = # of dimensions



Number of features potentially enormous# words in corpus continues to increase w/corpus size




High dimensionality problematic:




High dimensionality problematic:Leads to data sparseness





Hard to create valid model Hard to predict and generalize – think kNN






Leads to high computational cost


Think of the instances as vectors of features # of features = # of dimensions

Number of features potentially enormous # words in corpus continues to increase w/corpus size

High dimensionality problematic: Leads to data sparseness


Leads to high computational cost Leads to difficulty with estimation/learning

More dimensions more samples needed to learn model

Breaking the CurseDimensionality reduction:

Produce a representation with fewer dimensions But with comparable performance



More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.




Functionally,Many ML algorithms do not scale well




Functionally,Many ML algorithms do not scale well

Expensive: Training cost, training costPoor prediction: overfitting, sparseness

Dimensionality ReductionGiven an initial feature set r,

Create a feature set r’ s.t. |r| < |r’|

Approaches:



Approaches:r’: same for all classes (aka global), vsr’: different for each class (aka local)



Approaches:r’: same for all classes (aka global), vsr’: different for each class (aka local)

Feature selection/filtering, vsFeature mapping (aka extraction)

Feature SelectionFeature selection:

r’ is a subset of rHow can we pick features?



Extrinsic ‘wrapper’ approaches:



Extrinsic ‘wrapper’ approaches: For each subset of features:

Build, evaluate classifier for some task Pick subset of features with best performance



Extrinsic ‘wrapper’ approaches: For each subset of features:

Build, evaluate classifier for some task Pick subset of features with best performance

Intrinsic ‘filtering’ methods: Use some intrinsic (statistical?) measure Pick features with highest scores

Feature SelectionWrapper approach:

Pros:


Pros: Easy to understand, implementClear relationship b/t selected features and task perf.

Cons:



Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov

Filtering approach:Pros




Filtering approach:Pros: theoretical basis, less task, classifier specificCons:




Filtering approach:Pros: theoretical basis, less task, classifier specificCons: Doesn’t always boost task performance

Feature MappingFeature mapping (extraction) approaches

Features r’ representation combinations/transformations of features in r


Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as

unrelated



unrelated Map to new concept representing all

big, large, huge, gigantic, enormous concept of ‘bigness’

Examples:



unrelated Map to new concept representing all

big, large, huge, gigantic, enormous concept of ‘bigness’

Examples:Term classes: e.g. class-based n-grams

Derived from term clusters


Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as unrelated

Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’

Examples:Term classes: e.g. class-based n-grams

Derived from term clusters

Dimensions in Latent Semantic Analysis (LSA/LSI) Result of Singular Value Decomposition (SVD) on matrix

Produces ‘closest’ rank r’ approximation of original

Feature MappingPros:


Data-drivenTheoretical basis – guarantees on matrix similarityNot bound by initial feature space

Cons:


Data-drivenTheoretical basis – guarantees on matrix similarityNot bound by initial feature space

Cons:Some ad-hoc factors:

e.g. # of dimensionsResulting feature space can be hard to interpret

Feature FilteringFiltering approaches:

Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class



Fairly fast and classifier-independent



Fairly fast and classifier-independent

Many different measures:Mutual informationInformation gainChi-squared etc…

Feature Scoring Measures

Basic Notation, Distributions

Assume binary representation of terms, classes

tk : term in T; ci: class in C

Basic Notation, Distributions

Assume binary representation of terms, classes

tk : term in T; ci: class in C

P(tk): proportion of documents in which tk appears

P(ci): proportion of documents of class ci

Binary so have

Setting Up!ci ci

!tk a b

tk c d

Feature Selection Functions

Question: What makes a good features?



Perspective: Best features:

Features that are most DIFFERENTLY distributed across classes



Perspective: Best features:

Features that are most DIFFERENTLY distributed across classes

I.e. features are best that most effectively differentiate between classes

Term Selection Functions: DF

Document frequency (DF):Number of documents in which tk appears



Applying DF: Remove terms with DF below some threshold




Intuition:Very rare terms: won’t help with categorization

Or not useful globally

Pros:






Pros: Easy to implement, scalable

Cons:






Pros: Easy to implement, scalable

Cons: Ad-hoc, low DF terms ‘topical’

Term Selection Functions: MI

Pointwise Mutual Information (MI)



MI=0 if t,c independent



MI=0 if t,c independent

Issue: Can be heavily influenced by marginalProblem comparing terms of differing frequencies

Term Selection Functions: IG

Information Gain: Intuition: Transmitting Y, how many bits can we

save if we know X? IG(Y,Xi) = H(Y)-H(Y|X)

Information Gain: Derivation

From F. Xia, ‘11

More Feature SelectionGSS coefficient:

From F. Xia, ‘11


NGL coefficient: N : # of docs

From F. Xia, ‘11


NGL coefficient: N : # of docs

Chi-square:

From F. Xia, ‘11

More Term SelectionRelevancy score:

From F. Xia, ‘11

More Term SelectionRelevancy score:

Odds Ratio:

From F. Xia, ‘11

Global SelectionPrevious measures compute class-specific

selection

What if you want to filter across ALL classes?Compute an aggregate measure across classes

Sum:

Average:

Max:

From F. Xia, ‘11

What’s the best?Answer:

It depends on ….ClassifiersType of data…

According to (Yang and Pedersen, 1997):{OR,NGL,GSS} > {X2

max,Igsum}> {#avg}>>{MI}On text classification tasks

Using kNN

From F. Xia, ‘11

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Term frequency (tf): # occurrences of tk in document di

Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

tfidf = tf*idf

Chi SquareTests for presence/absence of relation random

variablesBivariate analysis tests 2 random variables

Can test strength of relationship

(Strictly speaking) doesn’t test direction

Chi Square ExampleCan gender predict shoe choice?

A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Comparing DistributionsObserved distribution (O):

Expected distribution (E):


boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 50

Female 50

Total 19 22 20 25 14 100

Due to F. Xia

Comparing DistributionsObserved distribution (O):

Expected distribution (E):


boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 7 50

Female 9.5 11 10 12.5 7 50

Total 19 22 20 25 14 100

Due to F. Xia

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026

Calculating X2

Tabulate contigency table of observed values: O

Compute row, column totals

Compute table of expected values, given row/colAssuming no association

Compute X2

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N (c+d)(b+d)/N c+d

total a+c b+d N

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic: Compute degrees of freedom

df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

Test probability of X2 statistic value X2 table

If probability is low – below some significance level Can reject null hypothesis

Requirements for X2 TestEvents assumed independent, same distribution

Outcomes must be mutually exclusive

Raw frequencies, not percentages

Sufficient values per cell: > 5

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

Documents

Transcript of Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.