Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.
Transcript of Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.
Feature SelectionAdvanced Statistical Methods in NLP
Ling 572January 24, 2012
RoadmapFeature representations:
Features in attribute-value matricesMotivation: text classification
Managing featuresGeneral approachesFeature selection techniques
Feature scoring measures Alternative feature weighting
Chi-squared feature selection
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
Choosing features:• Define features – i.e. with feature templates
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
Choosing features:• Define features – i.e. with feature templates• Instantiate features
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reduction
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reductionWeighting features: increase/decrease feature import
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reductionWeighting features: increase/decrease feature import• Global feature weighting: weight whole column• Local feature weighting: weight cell, conditions
Feature Selection ExampleTask: Text classification
Feature template definition:
Feature Selection ExampleTask: Text classification
Feature template definition:Word – just one template
Feature instantiation:
Feature Selection ExampleTask: Text classification
Feature template definition:Word – just one template
Feature instantiation:Words from training (and test?) data
Feature selection:
Feature Selection ExampleTask: Text classification
Feature template definition:Word – just one template
Feature instantiation:Words from training (and test?) data
Feature selection:Stopword removal: remove top K (~100) highest freq
Words like: the, a, have, is, to, for,…
Feature weighting:
Feature Selection Example Task: Text classification
Feature template definition: Word – just one template
Feature instantiation: Words from training (and test?) data
Feature selection: Stopword removal: remove top K (~100) highest freq
Words like: the, a, have, is, to, for,…
Feature weighting: Apply tf*idf feature weighting
tf = term frequency; idf = inverse document frequency
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
Number of features potentially enormous# words in corpus continues to increase w/corpus size
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
Number of features potentially enormous# words in corpus continues to increase w/corpus size
High dimensionality problematic:
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
Number of features potentially enormous# words in corpus continues to increase w/corpus size
High dimensionality problematic:Leads to data sparseness
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
Number of features potentially enormous# words in corpus continues to increase w/corpus size
High dimensionality problematic:Leads to data sparseness
Hard to create valid model Hard to predict and generalize – think kNN
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
Number of features potentially enormous# words in corpus continues to increase w/corpus size
High dimensionality problematic:Leads to data sparseness
Hard to create valid model Hard to predict and generalize – think kNN
Leads to high computational cost
The Curse of Dimensionality
Think of the instances as vectors of features # of features = # of dimensions
Number of features potentially enormous # words in corpus continues to increase w/corpus size
High dimensionality problematic: Leads to data sparseness
Hard to create valid model Hard to predict and generalize – think kNN
Leads to high computational cost Leads to difficulty with estimation/learning
More dimensions more samples needed to learn model
Breaking the CurseDimensionality reduction:
Produce a representation with fewer dimensions But with comparable performance
Breaking the CurseDimensionality reduction:
Produce a representation with fewer dimensions But with comparable performance
More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.
Breaking the CurseDimensionality reduction:
Produce a representation with fewer dimensions But with comparable performance
More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.
Functionally,Many ML algorithms do not scale well
Breaking the CurseDimensionality reduction:
Produce a representation with fewer dimensions But with comparable performance
More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.
Functionally,Many ML algorithms do not scale well
Expensive: Training cost, training costPoor prediction: overfitting, sparseness
Dimensionality ReductionGiven an initial feature set r,
Create a feature set r’ s.t. |r| < |r’|
Approaches:
Dimensionality ReductionGiven an initial feature set r,
Create a feature set r’ s.t. |r| < |r’|
Approaches:r’: same for all classes (aka global), vsr’: different for each class (aka local)
Dimensionality ReductionGiven an initial feature set r,
Create a feature set r’ s.t. |r| < |r’|
Approaches:r’: same for all classes (aka global), vsr’: different for each class (aka local)
Feature selection/filtering, vsFeature mapping (aka extraction)
Feature SelectionFeature selection:
r’ is a subset of rHow can we pick features?
Feature SelectionFeature selection:
r’ is a subset of rHow can we pick features?
Extrinsic ‘wrapper’ approaches:
Feature SelectionFeature selection:
r’ is a subset of rHow can we pick features?
Extrinsic ‘wrapper’ approaches: For each subset of features:
Build, evaluate classifier for some task Pick subset of features with best performance
Feature SelectionFeature selection:
r’ is a subset of rHow can we pick features?
Extrinsic ‘wrapper’ approaches: For each subset of features:
Build, evaluate classifier for some task Pick subset of features with best performance
Intrinsic ‘filtering’ methods: Use some intrinsic (statistical?) measure Pick features with highest scores
Feature SelectionWrapper approach:
Pros:
Feature SelectionWrapper approach:
Pros: Easy to understand, implementClear relationship b/t selected features and task perf.
Cons:
Feature SelectionWrapper approach:
Pros: Easy to understand, implementClear relationship b/t selected features and task perf.
Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov
Filtering approach:Pros
Feature SelectionWrapper approach:
Pros: Easy to understand, implementClear relationship b/t selected features and task perf.
Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov
Filtering approach:Pros: theoretical basis, less task, classifier specificCons:
Feature SelectionWrapper approach:
Pros: Easy to understand, implementClear relationship b/t selected features and task perf.
Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov
Filtering approach:Pros: theoretical basis, less task, classifier specificCons: Doesn’t always boost task performance
Feature MappingFeature mapping (extraction) approaches
Features r’ representation combinations/transformations of features in r
Feature MappingFeature mapping (extraction) approaches
Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as
unrelated
Feature MappingFeature mapping (extraction) approaches
Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as
unrelated Map to new concept representing all
big, large, huge, gigantic, enormous concept of ‘bigness’
Examples:
Feature MappingFeature mapping (extraction) approaches
Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as
unrelated Map to new concept representing all
big, large, huge, gigantic, enormous concept of ‘bigness’
Examples:Term classes: e.g. class-based n-grams
Derived from term clusters
Feature MappingFeature mapping (extraction) approaches
Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as unrelated
Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’
Examples:Term classes: e.g. class-based n-grams
Derived from term clusters
Dimensions in Latent Semantic Analysis (LSA/LSI) Result of Singular Value Decomposition (SVD) on matrix
Produces ‘closest’ rank r’ approximation of original
Feature MappingPros:
Feature MappingPros:
Data-drivenTheoretical basis – guarantees on matrix similarityNot bound by initial feature space
Cons:
Feature MappingPros:
Data-drivenTheoretical basis – guarantees on matrix similarityNot bound by initial feature space
Cons:Some ad-hoc factors:
e.g. # of dimensionsResulting feature space can be hard to interpret
Feature FilteringFiltering approaches:
Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class
Feature FilteringFiltering approaches:
Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class
Fairly fast and classifier-independent
Feature FilteringFiltering approaches:
Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class
Fairly fast and classifier-independent
Many different measures:Mutual informationInformation gainChi-squared etc…
Feature Scoring Measures
Basic Notation, Distributions
Assume binary representation of terms, classes
tk : term in T; ci: class in C
Basic Notation, Distributions
Assume binary representation of terms, classes
tk : term in T; ci: class in C
P(tk): proportion of documents in which tk appears
P(ci): proportion of documents of class ci
Binary so have
Basic Notation, Distributions
Assume binary representation of terms, classes
tk : term in T; ci: class in C
P(tk): proportion of documents in which tk appears
P(ci): proportion of documents of class ci
Binary so have
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Feature Selection Functions
Question: What makes a good features?
Feature Selection Functions
Question: What makes a good features?
Perspective: Best features:
Features that are most DIFFERENTLY distributed across classes
Feature Selection Functions
Question: What makes a good features?
Perspective: Best features:
Features that are most DIFFERENTLY distributed across classes
I.e. features are best that most effectively differentiate between classes
Term Selection Functions: DF
Document frequency (DF):Number of documents in which tk appears
Term Selection Functions: DF
Document frequency (DF):Number of documents in which tk appears
Applying DF: Remove terms with DF below some threshold
Term Selection Functions: DF
Document frequency (DF):Number of documents in which tk appears
Applying DF: Remove terms with DF below some threshold
Intuition:Very rare terms: won’t help with categorization
Or not useful globally
Pros:
Term Selection Functions: DF
Document frequency (DF):Number of documents in which tk appears
Applying DF: Remove terms with DF below some threshold
Intuition:Very rare terms: won’t help with categorization
Or not useful globally
Pros: Easy to implement, scalable
Cons:
Term Selection Functions: DF
Document frequency (DF):Number of documents in which tk appears
Applying DF: Remove terms with DF below some threshold
Intuition:Very rare terms: won’t help with categorization
Or not useful globally
Pros: Easy to implement, scalable
Cons: Ad-hoc, low DF terms ‘topical’
Term Selection Functions: MI
Pointwise Mutual Information (MI)
Term Selection Functions: MI
Pointwise Mutual Information (MI)
MI=0 if t,c independent
Term Selection Functions: MI
Pointwise Mutual Information (MI)
MI=0 if t,c independent
Issue: Can be heavily influenced by marginalProblem comparing terms of differing frequencies
Term Selection Functions: IG
Information Gain: Intuition: Transmitting Y, how many bits can we
save if we know X? IG(Y,Xi) = H(Y)-H(Y|X)
Information Gain: Derivation
From F. Xia, ‘11
More Feature SelectionGSS coefficient:
From F. Xia, ‘11
More Feature SelectionGSS coefficient:
NGL coefficient: N : # of docs
From F. Xia, ‘11
More Feature SelectionGSS coefficient:
NGL coefficient: N : # of docs
Chi-square:
From F. Xia, ‘11
More Feature SelectionGSS coefficient:
NGL coefficient: N : # of docs
Chi-square:
From F. Xia, ‘11
More Term SelectionRelevancy score:
From F. Xia, ‘11
More Term SelectionRelevancy score:
Odds Ratio:
From F. Xia, ‘11
Global SelectionPrevious measures compute class-specific
selection
What if you want to filter across ALL classes?Compute an aggregate measure across classes
Sum:
Average:
Max:
From F. Xia, ‘11
What’s the best?Answer:
It depends on ….ClassifiersType of data…
According to (Yang and Pedersen, 1997):{OR,NGL,GSS} > {X2
max,Igsum}> {#avg}>>{MI}On text classification tasks
Using kNN
From F. Xia, ‘11
Feature WeightingFor text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf): # occurrences of tk in document di
Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs
idf = log (N/(1+dfk))
tfidf = tf*idf
Chi SquareTests for presence/absence of relation random
variablesBivariate analysis tests 2 random variables
Can test strength of relationship
(Strictly speaking) doesn’t test direction
Chi Square ExampleCan gender predict shoe choice?
A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}
Due to F. Xia
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 50
Female 50
Total 19 22 20 25 14 100
Due to F. Xia
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 11 10 12.5 7 50
Female 9.5 11 10 12.5 7 50
Total 19 22 20 25 14 100
Due to F. Xia
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
Compute table of expected values, given row/colAssuming no association
Compute X2
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk (a+b)(a+c)/N
(a+b)(b+d)/N a+b
tk (c+d)(a+c)/N (c+d)(b+d)/N c+d
total a+c b+d N
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic: Compute degrees of freedom
df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4
Test probability of X2 statistic value X2 table
If probability is low – below some significance level Can reject null hypothesis
Requirements for X2 TestEvents assumed independent, same distribution
Outcomes must be mutually exclusive
Raw frequencies, not percentages
Sufficient values per cell: > 5