Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf ·...
Transcript of Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf ·...
![Page 1: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/1.jpg)
Introduction to Machine Learning (67577)Lecture 13
Shai Shalev-Shwartz
School of CS and Engineering,The Hebrew University of Jerusalem
Features
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 1 / 28
![Page 2: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/2.jpg)
Feature Selection
How to represent real-world objects (e.g. Papaya) as a feature vector?
Even if we have a representation as a feature vector, maybe there’s a“better” representation ?
What is “better”? depends on the hypothesis class:Example: regression problem,
x1 ∼ U [−1, 1], y = x21, x2 ∼ U [y − 0.01, y + 0.01]
Which feature is better, x1 or x2?
If the hypothesis class is linear regressors, we should prefer x2. If thehypothesis class is quadratic regressors, we should prefer x1.
No-free-lunch ...
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 2 / 28
![Page 3: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/3.jpg)
Feature Selection
How to represent real-world objects (e.g. Papaya) as a feature vector?
Even if we have a representation as a feature vector, maybe there’s a“better” representation ?
What is “better”? depends on the hypothesis class:Example: regression problem,
x1 ∼ U [−1, 1], y = x21, x2 ∼ U [y − 0.01, y + 0.01]
Which feature is better, x1 or x2?
If the hypothesis class is linear regressors, we should prefer x2. If thehypothesis class is quadratic regressors, we should prefer x1.
No-free-lunch ...
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 2 / 28
![Page 4: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/4.jpg)
Feature Selection
How to represent real-world objects (e.g. Papaya) as a feature vector?
Even if we have a representation as a feature vector, maybe there’s a“better” representation ?
What is “better”? depends on the hypothesis class:Example: regression problem,
x1 ∼ U [−1, 1], y = x21, x2 ∼ U [y − 0.01, y + 0.01]
Which feature is better, x1 or x2?
If the hypothesis class is linear regressors, we should prefer x2. If thehypothesis class is quadratic regressors, we should prefer x1.
No-free-lunch ...
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 2 / 28
![Page 5: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/5.jpg)
Feature Selection
How to represent real-world objects (e.g. Papaya) as a feature vector?
Even if we have a representation as a feature vector, maybe there’s a“better” representation ?
What is “better”? depends on the hypothesis class:Example: regression problem,
x1 ∼ U [−1, 1], y = x21, x2 ∼ U [y − 0.01, y + 0.01]
Which feature is better, x1 or x2?
If the hypothesis class is linear regressors, we should prefer x2. If thehypothesis class is quadratic regressors, we should prefer x1.
No-free-lunch ...
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 2 / 28
![Page 6: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/6.jpg)
Feature Selection
How to represent real-world objects (e.g. Papaya) as a feature vector?
Even if we have a representation as a feature vector, maybe there’s a“better” representation ?
What is “better”? depends on the hypothesis class:Example: regression problem,
x1 ∼ U [−1, 1], y = x21, x2 ∼ U [y − 0.01, y + 0.01]
Which feature is better, x1 or x2?
If the hypothesis class is linear regressors, we should prefer x2. If thehypothesis class is quadratic regressors, we should prefer x1.
No-free-lunch ...
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 2 / 28
![Page 7: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/7.jpg)
Outline
1 Feature SelectionFiltersGreedy selection`1 norm
2 Feature Manipulation and Normalization
3 Feature Learning
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 3 / 28
![Page 8: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/8.jpg)
Feature Selection
X = Rd
We’d like to learn a predictor that only relies on k � d features
Why ?
Can reduce estimation errorReduces memory and runtime (both at train and test time)Obtaining features may be costly (e.g. medical applications)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 4 / 28
![Page 9: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/9.jpg)
Feature Selection
Optimal approach: try all subsets of k out of d features and choosethe one which leads to best performing predictor
Problem: runtime is dk... can formally prove hardness in manysituations
We describe three computationally efficient heuristics (some of themcome with some types of formal guarantees, but this is beyond thescope)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 5 / 28
![Page 10: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/10.jpg)
Feature Selection
Optimal approach: try all subsets of k out of d features and choosethe one which leads to best performing predictor
Problem: runtime is dk... can formally prove hardness in manysituations
We describe three computationally efficient heuristics (some of themcome with some types of formal guarantees, but this is beyond thescope)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 5 / 28
![Page 11: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/11.jpg)
Feature Selection
Optimal approach: try all subsets of k out of d features and choosethe one which leads to best performing predictor
Problem: runtime is dk... can formally prove hardness in manysituations
We describe three computationally efficient heuristics (some of themcome with some types of formal guarantees, but this is beyond thescope)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 5 / 28
![Page 12: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/12.jpg)
Outline
1 Feature SelectionFiltersGreedy selection`1 norm
2 Feature Manipulation and Normalization
3 Feature Learning
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 6 / 28
![Page 13: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/13.jpg)
Filters
Filter method: assess individual features, independently of otherfeatures, according to some quality measure, and select k featureswith highest score
Score function: Many possible score functions. E.g.:
Minimize loss: Rank features according to
− mina,b∈R
m∑i=1
`(avi + b, yi)
Pearson correlation coefficient: (obtained by minimizing squared loss)
|〈v − v,y − y〉|‖v − v‖ ‖y − y‖
Spearman’s rho: Apply Pearson’s coefficient on the ranking of vMutual information:
∑p(vi, yi) log(p(vi, yi)/(p(vi)p(yi)))
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 7 / 28
![Page 14: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/14.jpg)
Filters
Filter method: assess individual features, independently of otherfeatures, according to some quality measure, and select k featureswith highest score
Score function: Many possible score functions. E.g.:
Minimize loss: Rank features according to
− mina,b∈R
m∑i=1
`(avi + b, yi)
Pearson correlation coefficient: (obtained by minimizing squared loss)
|〈v − v,y − y〉|‖v − v‖ ‖y − y‖
Spearman’s rho: Apply Pearson’s coefficient on the ranking of vMutual information:
∑p(vi, yi) log(p(vi, yi)/(p(vi)p(yi)))
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 7 / 28
![Page 15: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/15.jpg)
Filters
Filter method: assess individual features, independently of otherfeatures, according to some quality measure, and select k featureswith highest score
Score function: Many possible score functions. E.g.:
Minimize loss: Rank features according to
− mina,b∈R
m∑i=1
`(avi + b, yi)
Pearson correlation coefficient: (obtained by minimizing squared loss)
|〈v − v,y − y〉|‖v − v‖ ‖y − y‖
Spearman’s rho: Apply Pearson’s coefficient on the ranking of vMutual information:
∑p(vi, yi) log(p(vi, yi)/(p(vi)p(yi)))
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 7 / 28
![Page 16: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/16.jpg)
Filters
Filter method: assess individual features, independently of otherfeatures, according to some quality measure, and select k featureswith highest score
Score function: Many possible score functions. E.g.:
Minimize loss: Rank features according to
− mina,b∈R
m∑i=1
`(avi + b, yi)
Pearson correlation coefficient: (obtained by minimizing squared loss)
|〈v − v,y − y〉|‖v − v‖ ‖y − y‖
Spearman’s rho: Apply Pearson’s coefficient on the ranking of vMutual information:
∑p(vi, yi) log(p(vi, yi)/(p(vi)p(yi)))
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 7 / 28
![Page 17: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/17.jpg)
Filters
Filter method: assess individual features, independently of otherfeatures, according to some quality measure, and select k featureswith highest score
Score function: Many possible score functions. E.g.:
Minimize loss: Rank features according to
− mina,b∈R
m∑i=1
`(avi + b, yi)
Pearson correlation coefficient: (obtained by minimizing squared loss)
|〈v − v,y − y〉|‖v − v‖ ‖y − y‖
Spearman’s rho: Apply Pearson’s coefficient on the ranking of v
Mutual information:∑p(vi, yi) log(p(vi, yi)/(p(vi)p(yi)))
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 7 / 28
![Page 18: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/18.jpg)
Filters
Filter method: assess individual features, independently of otherfeatures, according to some quality measure, and select k featureswith highest score
Score function: Many possible score functions. E.g.:
Minimize loss: Rank features according to
− mina,b∈R
m∑i=1
`(avi + b, yi)
Pearson correlation coefficient: (obtained by minimizing squared loss)
|〈v − v,y − y〉|‖v − v‖ ‖y − y‖
Spearman’s rho: Apply Pearson’s coefficient on the ranking of vMutual information:
∑p(vi, yi) log(p(vi, yi)/(p(vi)p(yi)))
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 7 / 28
![Page 19: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/19.jpg)
Weakness of Filters
If Pearson’s coefficient is zero then v alone is useless for predicting y
This doesn’t mean that v is a bad feature — maybe with otherfeatures it is very useful
Example:
y = x1 + 2x2, x1 ∼ U [±1], x2 = (z − x1)/2, z ∼ U [±1]
Then, Pearson of x1 is zero, but no function can predict y without x1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 8 / 28
![Page 20: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/20.jpg)
Weakness of Filters
If Pearson’s coefficient is zero then v alone is useless for predicting y
This doesn’t mean that v is a bad feature — maybe with otherfeatures it is very useful
Example:
y = x1 + 2x2, x1 ∼ U [±1], x2 = (z − x1)/2, z ∼ U [±1]
Then, Pearson of x1 is zero, but no function can predict y without x1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 8 / 28
![Page 21: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/21.jpg)
Weakness of Filters
If Pearson’s coefficient is zero then v alone is useless for predicting y
This doesn’t mean that v is a bad feature — maybe with otherfeatures it is very useful
Example:
y = x1 + 2x2, x1 ∼ U [±1], x2 = (z − x1)/2, z ∼ U [±1]
Then, Pearson of x1 is zero, but no function can predict y without x1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 8 / 28
![Page 22: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/22.jpg)
Outline
1 Feature SelectionFiltersGreedy selection`1 norm
2 Feature Manipulation and Normalization
3 Feature Learning
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 9 / 28
![Page 23: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/23.jpg)
Forward Greedy Selection
Start with empty set of features I = ∅
At each iteration, go over all i /∈ I and learn a predictor based on I ∪ iChoose the i that led to best predictor and update I = I ∪ {i}Example: Orthogonal Matching Pursuit
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 10 / 28
![Page 24: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/24.jpg)
Forward Greedy Selection
Start with empty set of features I = ∅At each iteration, go over all i /∈ I and learn a predictor based on I ∪ i
Choose the i that led to best predictor and update I = I ∪ {i}Example: Orthogonal Matching Pursuit
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 10 / 28
![Page 25: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/25.jpg)
Forward Greedy Selection
Start with empty set of features I = ∅At each iteration, go over all i /∈ I and learn a predictor based on I ∪ iChoose the i that led to best predictor and update I = I ∪ {i}
Example: Orthogonal Matching Pursuit
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 10 / 28
![Page 26: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/26.jpg)
Forward Greedy Selection
Start with empty set of features I = ∅At each iteration, go over all i /∈ I and learn a predictor based on I ∪ iChoose the i that led to best predictor and update I = I ∪ {i}Example: Orthogonal Matching Pursuit
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 10 / 28
![Page 27: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/27.jpg)
Orthogonal Matching Pursuit (OMP)
Let X ∈ Rm,d be a data matrix (instances in rows). Let y ∈ Rm bethe targets vector.
Let Xi denote the i’th column of X and let XI be the matrix whosecolumns are {Xi : i ∈ I}.At iteration t, we add the feature
jt = argminj
minw∈Rt
‖XIt−1∪{j}w − y‖2 .
An efficient implementation: let Vt be a matrix whose columns areorthonormal basis of the columns of XIt . Clearly,
minw‖XItw − y‖2 = min
θ∈Rt‖Vtθ − y‖2 .
Let θt be a minimizer of the right-hand side
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 11 / 28
![Page 28: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/28.jpg)
Orthogonal Matching Pursuit (OMP)
Let X ∈ Rm,d be a data matrix (instances in rows). Let y ∈ Rm bethe targets vector.
Let Xi denote the i’th column of X and let XI be the matrix whosecolumns are {Xi : i ∈ I}.
At iteration t, we add the feature
jt = argminj
minw∈Rt
‖XIt−1∪{j}w − y‖2 .
An efficient implementation: let Vt be a matrix whose columns areorthonormal basis of the columns of XIt . Clearly,
minw‖XItw − y‖2 = min
θ∈Rt‖Vtθ − y‖2 .
Let θt be a minimizer of the right-hand side
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 11 / 28
![Page 29: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/29.jpg)
Orthogonal Matching Pursuit (OMP)
Let X ∈ Rm,d be a data matrix (instances in rows). Let y ∈ Rm bethe targets vector.
Let Xi denote the i’th column of X and let XI be the matrix whosecolumns are {Xi : i ∈ I}.At iteration t, we add the feature
jt = argminj
minw∈Rt
‖XIt−1∪{j}w − y‖2 .
An efficient implementation: let Vt be a matrix whose columns areorthonormal basis of the columns of XIt . Clearly,
minw‖XItw − y‖2 = min
θ∈Rt‖Vtθ − y‖2 .
Let θt be a minimizer of the right-hand side
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 11 / 28
![Page 30: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/30.jpg)
Orthogonal Matching Pursuit (OMP)
Let X ∈ Rm,d be a data matrix (instances in rows). Let y ∈ Rm bethe targets vector.
Let Xi denote the i’th column of X and let XI be the matrix whosecolumns are {Xi : i ∈ I}.At iteration t, we add the feature
jt = argminj
minw∈Rt
‖XIt−1∪{j}w − y‖2 .
An efficient implementation: let Vt be a matrix whose columns areorthonormal basis of the columns of XIt . Clearly,
minw‖XItw − y‖2 = min
θ∈Rt‖Vtθ − y‖2 .
Let θt be a minimizer of the right-hand side
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 11 / 28
![Page 31: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/31.jpg)
Orthogonal Matching Pursuit (OMP)
Let X ∈ Rm,d be a data matrix (instances in rows). Let y ∈ Rm bethe targets vector.
Let Xi denote the i’th column of X and let XI be the matrix whosecolumns are {Xi : i ∈ I}.At iteration t, we add the feature
jt = argminj
minw∈Rt
‖XIt−1∪{j}w − y‖2 .
An efficient implementation: let Vt be a matrix whose columns areorthonormal basis of the columns of XIt . Clearly,
minw‖XItw − y‖2 = min
θ∈Rt‖Vtθ − y‖2 .
Let θt be a minimizer of the right-hand side
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 11 / 28
![Page 32: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/32.jpg)
Orthogonal Matching Pursuit (OMP)
Given Vt−1 and θt−1, we write for every j, Xj = Vt−1V>t−1Xj + uj ,
where uj is orthogonal to Vj . Then:
minθ,α‖Vt−1θ + αuj − y‖2
= minθ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj , Vt−1θ − y〉
]= min
θ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj ,−y〉
]= min
θ
[‖Vt−1θ − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]=[‖Vt−1θt−1 − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]= ‖Vt−1θt−1 − y‖2 − (〈uj ,y〉)2
‖uj‖2
It follows that we should select the feature jt = argmaxj(〈uj ,y〉)2‖uj‖2 .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 12 / 28
![Page 33: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/33.jpg)
Orthogonal Matching Pursuit (OMP)
Given Vt−1 and θt−1, we write for every j, Xj = Vt−1V>t−1Xj + uj ,
where uj is orthogonal to Vj . Then:
minθ,α‖Vt−1θ + αuj − y‖2
= minθ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj , Vt−1θ − y〉
]= min
θ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj ,−y〉
]= min
θ
[‖Vt−1θ − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]=[‖Vt−1θt−1 − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]= ‖Vt−1θt−1 − y‖2 − (〈uj ,y〉)2
‖uj‖2
It follows that we should select the feature jt = argmaxj(〈uj ,y〉)2‖uj‖2 .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 12 / 28
![Page 34: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/34.jpg)
Orthogonal Matching Pursuit (OMP)
Given Vt−1 and θt−1, we write for every j, Xj = Vt−1V>t−1Xj + uj ,
where uj is orthogonal to Vj . Then:
minθ,α‖Vt−1θ + αuj − y‖2
= minθ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj , Vt−1θ − y〉
]
= minθ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj ,−y〉
]= min
θ
[‖Vt−1θ − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]=[‖Vt−1θt−1 − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]= ‖Vt−1θt−1 − y‖2 − (〈uj ,y〉)2
‖uj‖2
It follows that we should select the feature jt = argmaxj(〈uj ,y〉)2‖uj‖2 .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 12 / 28
![Page 35: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/35.jpg)
Orthogonal Matching Pursuit (OMP)
Given Vt−1 and θt−1, we write for every j, Xj = Vt−1V>t−1Xj + uj ,
where uj is orthogonal to Vj . Then:
minθ,α‖Vt−1θ + αuj − y‖2
= minθ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj , Vt−1θ − y〉
]= min
θ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj ,−y〉
]
= minθ
[‖Vt−1θ − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]=[‖Vt−1θt−1 − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]= ‖Vt−1θt−1 − y‖2 − (〈uj ,y〉)2
‖uj‖2
It follows that we should select the feature jt = argmaxj(〈uj ,y〉)2‖uj‖2 .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 12 / 28
![Page 36: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/36.jpg)
Orthogonal Matching Pursuit (OMP)
Given Vt−1 and θt−1, we write for every j, Xj = Vt−1V>t−1Xj + uj ,
where uj is orthogonal to Vj . Then:
minθ,α‖Vt−1θ + αuj − y‖2
= minθ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj , Vt−1θ − y〉
]= min
θ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj ,−y〉
]= min
θ
[‖Vt−1θ − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]
=[‖Vt−1θt−1 − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]= ‖Vt−1θt−1 − y‖2 − (〈uj ,y〉)2
‖uj‖2
It follows that we should select the feature jt = argmaxj(〈uj ,y〉)2‖uj‖2 .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 12 / 28
![Page 37: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/37.jpg)
Orthogonal Matching Pursuit (OMP)
Given Vt−1 and θt−1, we write for every j, Xj = Vt−1V>t−1Xj + uj ,
where uj is orthogonal to Vj . Then:
minθ,α‖Vt−1θ + αuj − y‖2
= minθ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj , Vt−1θ − y〉
]= min
θ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj ,−y〉
]= min
θ
[‖Vt−1θ − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]=[‖Vt−1θt−1 − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]
= ‖Vt−1θt−1 − y‖2 − (〈uj ,y〉)2
‖uj‖2
It follows that we should select the feature jt = argmaxj(〈uj ,y〉)2‖uj‖2 .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 12 / 28
![Page 38: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/38.jpg)
Orthogonal Matching Pursuit (OMP)
Given Vt−1 and θt−1, we write for every j, Xj = Vt−1V>t−1Xj + uj ,
where uj is orthogonal to Vj . Then:
minθ,α‖Vt−1θ + αuj − y‖2
= minθ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj , Vt−1θ − y〉
]= min
θ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj ,−y〉
]= min
θ
[‖Vt−1θ − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]=[‖Vt−1θt−1 − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]= ‖Vt−1θt−1 − y‖2 − (〈uj ,y〉)2
‖uj‖2
It follows that we should select the feature jt = argmaxj(〈uj ,y〉)2‖uj‖2 .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 12 / 28
![Page 39: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/39.jpg)
Orthogonal Matching Pursuit (OMP)
Given Vt−1 and θt−1, we write for every j, Xj = Vt−1V>t−1Xj + uj ,
where uj is orthogonal to Vj . Then:
minθ,α‖Vt−1θ + αuj − y‖2
= minθ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj , Vt−1θ − y〉
]= min
θ,α
[‖Vt−1θ − y‖2 + α2‖uj‖2 + 2α〈uj ,−y〉
]= min
θ
[‖Vt−1θ − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]=[‖Vt−1θt−1 − y‖2
]+ min
α
[α2‖uj‖2 − 2α〈uj ,y〉
]= ‖Vt−1θt−1 − y‖2 − (〈uj ,y〉)2
‖uj‖2
It follows that we should select the feature jt = argmaxj(〈uj ,y〉)2‖uj‖2 .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 12 / 28
![Page 40: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/40.jpg)
Orthogonal Matching Pursuit (OMP)
Orthogonal Matching Pursuit (OMP)
input:data matrix X ∈ Rm,d, labels vector y ∈ Rm,budget of features T
initialize: I1 = ∅for t = 1, . . . , T
use SVD to find an orthonormal basis V ∈ Rm,t−1 of XIt
(for t = 1 set V to be the all zeros matrix)foreach j ∈ [d] \ It let uj = Xj − V V >Xj
let jt = argmaxj /∈It:‖uj‖>0(〈uj ,y〉)2‖uj‖2
update It+1 = It ∪ {jt}output IT+1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 13 / 28
![Page 41: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/41.jpg)
Gradient-based Greedy Selection
Let R(w) be the empirical risk as a function of w
For the squared loss, R(w) = 1m‖Xw − y‖2, we can easily solve the
problemargmin
jmin
w:supp(w)=I∪{i}R(w)
For general R, this may be expensive. An approximation is to onlyoptimize w over the new feature:
argminj
minη∈R
R(w + ηej)
An even simpler approach is to choose the feature which minimizesthe above for infinitesimal η, namely,
argminj|∇jR(w)|
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 14 / 28
![Page 42: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/42.jpg)
Gradient-based Greedy Selection
Let R(w) be the empirical risk as a function of w
For the squared loss, R(w) = 1m‖Xw − y‖2, we can easily solve the
problemargmin
jmin
w:supp(w)=I∪{i}R(w)
For general R, this may be expensive. An approximation is to onlyoptimize w over the new feature:
argminj
minη∈R
R(w + ηej)
An even simpler approach is to choose the feature which minimizesthe above for infinitesimal η, namely,
argminj|∇jR(w)|
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 14 / 28
![Page 43: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/43.jpg)
Gradient-based Greedy Selection
Let R(w) be the empirical risk as a function of w
For the squared loss, R(w) = 1m‖Xw − y‖2, we can easily solve the
problemargmin
jmin
w:supp(w)=I∪{i}R(w)
For general R, this may be expensive. An approximation is to onlyoptimize w over the new feature:
argminj
minη∈R
R(w + ηej)
An even simpler approach is to choose the feature which minimizesthe above for infinitesimal η, namely,
argminj|∇jR(w)|
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 14 / 28
![Page 44: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/44.jpg)
Gradient-based Greedy Selection
Let R(w) be the empirical risk as a function of w
For the squared loss, R(w) = 1m‖Xw − y‖2, we can easily solve the
problemargmin
jmin
w:supp(w)=I∪{i}R(w)
For general R, this may be expensive. An approximation is to onlyoptimize w over the new feature:
argminj
minη∈R
R(w + ηej)
An even simpler approach is to choose the feature which minimizesthe above for infinitesimal η, namely,
argminj|∇jR(w)|
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 14 / 28
![Page 45: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/45.jpg)
AdaBoost as Forward Greedy Selection
It is possible to show (left as an exercise), that the AdaBoostalgorithm is in fact Forward Greedy Selection for the objectivefunction
R(w) = log
m∑i=1
exp
−yi d∑j=1
wjhj(xj)
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 15 / 28
![Page 46: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/46.jpg)
Outline
1 Feature SelectionFiltersGreedy selection`1 norm
2 Feature Manipulation and Normalization
3 Feature Learning
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 16 / 28
![Page 47: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/47.jpg)
Sparsity Inducing Norms
Minimizing the empirical risk subject to a budget of k features can bewritten as:
minw
LS(w) s.t. ‖w‖0 ≤ k ,
Replace the non-convex constraint, ‖w‖0 ≤ k, with a convexconstraint, ‖w‖1 ≤ k1.
Why `1 ?
“Closest” convex surrogateIf ‖w‖1 is small, can construct w with ‖w‖0 small and similar value ofLS
Often, `1 “induces” sparse solutions
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 17 / 28
![Page 48: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/48.jpg)
Sparsity Inducing Norms
Minimizing the empirical risk subject to a budget of k features can bewritten as:
minw
LS(w) s.t. ‖w‖0 ≤ k ,
Replace the non-convex constraint, ‖w‖0 ≤ k, with a convexconstraint, ‖w‖1 ≤ k1.
Why `1 ?
“Closest” convex surrogateIf ‖w‖1 is small, can construct w with ‖w‖0 small and similar value ofLS
Often, `1 “induces” sparse solutions
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 17 / 28
![Page 49: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/49.jpg)
Sparsity Inducing Norms
Minimizing the empirical risk subject to a budget of k features can bewritten as:
minw
LS(w) s.t. ‖w‖0 ≤ k ,
Replace the non-convex constraint, ‖w‖0 ≤ k, with a convexconstraint, ‖w‖1 ≤ k1.
Why `1 ?
“Closest” convex surrogateIf ‖w‖1 is small, can construct w with ‖w‖0 small and similar value ofLS
Often, `1 “induces” sparse solutions
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 17 / 28
![Page 50: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/50.jpg)
Sparsity Inducing Norms
Minimizing the empirical risk subject to a budget of k features can bewritten as:
minw
LS(w) s.t. ‖w‖0 ≤ k ,
Replace the non-convex constraint, ‖w‖0 ≤ k, with a convexconstraint, ‖w‖1 ≤ k1.
Why `1 ?
“Closest” convex surrogate
If ‖w‖1 is small, can construct w with ‖w‖0 small and similar value ofLS
Often, `1 “induces” sparse solutions
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 17 / 28
![Page 51: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/51.jpg)
Sparsity Inducing Norms
Minimizing the empirical risk subject to a budget of k features can bewritten as:
minw
LS(w) s.t. ‖w‖0 ≤ k ,
Replace the non-convex constraint, ‖w‖0 ≤ k, with a convexconstraint, ‖w‖1 ≤ k1.
Why `1 ?
“Closest” convex surrogateIf ‖w‖1 is small, can construct w with ‖w‖0 small and similar value ofLS
Often, `1 “induces” sparse solutions
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 17 / 28
![Page 52: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/52.jpg)
Sparsity Inducing Norms
Minimizing the empirical risk subject to a budget of k features can bewritten as:
minw
LS(w) s.t. ‖w‖0 ≤ k ,
Replace the non-convex constraint, ‖w‖0 ≤ k, with a convexconstraint, ‖w‖1 ≤ k1.
Why `1 ?
“Closest” convex surrogateIf ‖w‖1 is small, can construct w with ‖w‖0 small and similar value ofLS
Often, `1 “induces” sparse solutions
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 17 / 28
![Page 53: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/53.jpg)
`1 regularization
Instead of constraining ‖w‖1 we can regularize:
minw
(LS(w) + λ‖w‖1)
For Squared-Loss this is the Lasso method
`1 norm often induces sparse solutions. Example:
minw∈R
(1
2w2 − xw + λ|w|
).
East to verify that the solution is “soft thresholding”
w = sign(x) [|x| − λ]+
Sparsity: w = 0 unless |x| > λ
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 18 / 28
![Page 54: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/54.jpg)
`1 regularization
Instead of constraining ‖w‖1 we can regularize:
minw
(LS(w) + λ‖w‖1)
For Squared-Loss this is the Lasso method
`1 norm often induces sparse solutions. Example:
minw∈R
(1
2w2 − xw + λ|w|
).
East to verify that the solution is “soft thresholding”
w = sign(x) [|x| − λ]+
Sparsity: w = 0 unless |x| > λ
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 18 / 28
![Page 55: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/55.jpg)
`1 regularization
Instead of constraining ‖w‖1 we can regularize:
minw
(LS(w) + λ‖w‖1)
For Squared-Loss this is the Lasso method
`1 norm often induces sparse solutions. Example:
minw∈R
(1
2w2 − xw + λ|w|
).
East to verify that the solution is “soft thresholding”
w = sign(x) [|x| − λ]+
Sparsity: w = 0 unless |x| > λ
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 18 / 28
![Page 56: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/56.jpg)
`1 regularization
Instead of constraining ‖w‖1 we can regularize:
minw
(LS(w) + λ‖w‖1)
For Squared-Loss this is the Lasso method
`1 norm often induces sparse solutions. Example:
minw∈R
(1
2w2 − xw + λ|w|
).
East to verify that the solution is “soft thresholding”
w = sign(x) [|x| − λ]+
Sparsity: w = 0 unless |x| > λ
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 18 / 28
![Page 57: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/57.jpg)
`1 regularization
Instead of constraining ‖w‖1 we can regularize:
minw
(LS(w) + λ‖w‖1)
For Squared-Loss this is the Lasso method
`1 norm often induces sparse solutions. Example:
minw∈R
(1
2w2 − xw + λ|w|
).
East to verify that the solution is “soft thresholding”
w = sign(x) [|x| − λ]+
Sparsity: w = 0 unless |x| > λ
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 18 / 28
![Page 58: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/58.jpg)
`1 regularization
One dimensional Lasso:
argminw∈Rm
(1
2m
m∑i=1
(xiw − yi)2 + λ|w|
).
Rewrite:
argminw∈Rm
(1
2
(1m
∑i
x2i
)w2 −
(1m
m∑i=1
xiyi
)w + λ|w|
).
Assume 1m
∑i x
2i = 1, and denote 〈x,y〉 =
∑mi=1 xiyi, then the
optimal solution is
w = sign(〈x,y〉) [|〈x,y〉|/m− λ]+ .
Sparsity: w = 0 unless the correlation between x and y is larger thanλ.
Exercise: Show that the `2 norm doesn’t induce a sparse solution forthis case
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 19 / 28
![Page 59: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/59.jpg)
`1 regularization
One dimensional Lasso:
argminw∈Rm
(1
2m
m∑i=1
(xiw − yi)2 + λ|w|
).
Rewrite:
argminw∈Rm
(1
2
(1m
∑i
x2i
)w2 −
(1m
m∑i=1
xiyi
)w + λ|w|
).
Assume 1m
∑i x
2i = 1, and denote 〈x,y〉 =
∑mi=1 xiyi, then the
optimal solution is
w = sign(〈x,y〉) [|〈x,y〉|/m− λ]+ .
Sparsity: w = 0 unless the correlation between x and y is larger thanλ.
Exercise: Show that the `2 norm doesn’t induce a sparse solution forthis case
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 19 / 28
![Page 60: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/60.jpg)
`1 regularization
One dimensional Lasso:
argminw∈Rm
(1
2m
m∑i=1
(xiw − yi)2 + λ|w|
).
Rewrite:
argminw∈Rm
(1
2
(1m
∑i
x2i
)w2 −
(1m
m∑i=1
xiyi
)w + λ|w|
).
Assume 1m
∑i x
2i = 1, and denote 〈x,y〉 =
∑mi=1 xiyi, then the
optimal solution is
w = sign(〈x,y〉) [|〈x,y〉|/m− λ]+ .
Sparsity: w = 0 unless the correlation between x and y is larger thanλ.
Exercise: Show that the `2 norm doesn’t induce a sparse solution forthis case
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 19 / 28
![Page 61: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/61.jpg)
`1 regularization
One dimensional Lasso:
argminw∈Rm
(1
2m
m∑i=1
(xiw − yi)2 + λ|w|
).
Rewrite:
argminw∈Rm
(1
2
(1m
∑i
x2i
)w2 −
(1m
m∑i=1
xiyi
)w + λ|w|
).
Assume 1m
∑i x
2i = 1, and denote 〈x,y〉 =
∑mi=1 xiyi, then the
optimal solution is
w = sign(〈x,y〉) [|〈x,y〉|/m− λ]+ .
Sparsity: w = 0 unless the correlation between x and y is larger thanλ.
Exercise: Show that the `2 norm doesn’t induce a sparse solution forthis case
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 19 / 28
![Page 62: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/62.jpg)
`1 regularization
One dimensional Lasso:
argminw∈Rm
(1
2m
m∑i=1
(xiw − yi)2 + λ|w|
).
Rewrite:
argminw∈Rm
(1
2
(1m
∑i
x2i
)w2 −
(1m
m∑i=1
xiyi
)w + λ|w|
).
Assume 1m
∑i x
2i = 1, and denote 〈x,y〉 =
∑mi=1 xiyi, then the
optimal solution is
w = sign(〈x,y〉) [|〈x,y〉|/m− λ]+ .
Sparsity: w = 0 unless the correlation between x and y is larger thanλ.
Exercise: Show that the `2 norm doesn’t induce a sparse solution forthis case
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 19 / 28
![Page 63: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/63.jpg)
Outline
1 Feature SelectionFiltersGreedy selection`1 norm
2 Feature Manipulation and Normalization
3 Feature Learning
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 20 / 28
![Page 64: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/64.jpg)
Feature Manipulation and Normalization
Simple transformations that we apply on each of our original features
May decrease the approximation or estimation errors of ourhypothesis class, or can yield a faster algorithm
As in feature selection, there are no absolute “good” and “bad”transformations — need prior knowledge
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 21 / 28
![Page 65: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/65.jpg)
Feature Manipulation and Normalization
Simple transformations that we apply on each of our original features
May decrease the approximation or estimation errors of ourhypothesis class, or can yield a faster algorithm
As in feature selection, there are no absolute “good” and “bad”transformations — need prior knowledge
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 21 / 28
![Page 66: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/66.jpg)
Feature Manipulation and Normalization
Simple transformations that we apply on each of our original features
May decrease the approximation or estimation errors of ourhypothesis class, or can yield a faster algorithm
As in feature selection, there are no absolute “good” and “bad”transformations — need prior knowledge
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 21 / 28
![Page 67: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/67.jpg)
Example: The effect of Normalization
Consider 2-dim ridge regression problem:
argminw
[1
m‖Xw − y‖2 + λ‖w‖2
]= (2λmI +X>X)−1X>y .
Suppose: y ∼ U(±1), α ∼ U(±1), x1 = y + α/2, x2 = 0.0001y
Best weight vector is w? = [0; 10000], and LD(w?) = 0.
However, the objective of ridge regression at w? is λ108 while theobjective of ridge regression at w = [1; 0] is likely to be close to0.25 + λ ⇒ we’ll choose wrong solution if λ is not too small
Crux of the problem: features have completely different scale while `2regularization treats them equally
Simple solution: normalize features to have the same range (dividingby max, or by standard deviation)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 22 / 28
![Page 68: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/68.jpg)
Example: The effect of Normalization
Consider 2-dim ridge regression problem:
argminw
[1
m‖Xw − y‖2 + λ‖w‖2
]= (2λmI +X>X)−1X>y .
Suppose: y ∼ U(±1), α ∼ U(±1), x1 = y + α/2, x2 = 0.0001y
Best weight vector is w? = [0; 10000], and LD(w?) = 0.
However, the objective of ridge regression at w? is λ108 while theobjective of ridge regression at w = [1; 0] is likely to be close to0.25 + λ ⇒ we’ll choose wrong solution if λ is not too small
Crux of the problem: features have completely different scale while `2regularization treats them equally
Simple solution: normalize features to have the same range (dividingby max, or by standard deviation)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 22 / 28
![Page 69: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/69.jpg)
Example: The effect of Normalization
Consider 2-dim ridge regression problem:
argminw
[1
m‖Xw − y‖2 + λ‖w‖2
]= (2λmI +X>X)−1X>y .
Suppose: y ∼ U(±1), α ∼ U(±1), x1 = y + α/2, x2 = 0.0001y
Best weight vector is w? = [0; 10000], and LD(w?) = 0.
However, the objective of ridge regression at w? is λ108 while theobjective of ridge regression at w = [1; 0] is likely to be close to0.25 + λ ⇒ we’ll choose wrong solution if λ is not too small
Crux of the problem: features have completely different scale while `2regularization treats them equally
Simple solution: normalize features to have the same range (dividingby max, or by standard deviation)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 22 / 28
![Page 70: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/70.jpg)
Example: The effect of Normalization
Consider 2-dim ridge regression problem:
argminw
[1
m‖Xw − y‖2 + λ‖w‖2
]= (2λmI +X>X)−1X>y .
Suppose: y ∼ U(±1), α ∼ U(±1), x1 = y + α/2, x2 = 0.0001y
Best weight vector is w? = [0; 10000], and LD(w?) = 0.
However, the objective of ridge regression at w? is λ108 while theobjective of ridge regression at w = [1; 0] is likely to be close to0.25 + λ ⇒ we’ll choose wrong solution if λ is not too small
Crux of the problem: features have completely different scale while `2regularization treats them equally
Simple solution: normalize features to have the same range (dividingby max, or by standard deviation)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 22 / 28
![Page 71: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/71.jpg)
Example: The effect of Normalization
Consider 2-dim ridge regression problem:
argminw
[1
m‖Xw − y‖2 + λ‖w‖2
]= (2λmI +X>X)−1X>y .
Suppose: y ∼ U(±1), α ∼ U(±1), x1 = y + α/2, x2 = 0.0001y
Best weight vector is w? = [0; 10000], and LD(w?) = 0.
However, the objective of ridge regression at w? is λ108 while theobjective of ridge regression at w = [1; 0] is likely to be close to0.25 + λ ⇒ we’ll choose wrong solution if λ is not too small
Crux of the problem: features have completely different scale while `2regularization treats them equally
Simple solution: normalize features to have the same range (dividingby max, or by standard deviation)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 22 / 28
![Page 72: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/72.jpg)
Example: The effect of Normalization
Consider 2-dim ridge regression problem:
argminw
[1
m‖Xw − y‖2 + λ‖w‖2
]= (2λmI +X>X)−1X>y .
Suppose: y ∼ U(±1), α ∼ U(±1), x1 = y + α/2, x2 = 0.0001y
Best weight vector is w? = [0; 10000], and LD(w?) = 0.
However, the objective of ridge regression at w? is λ108 while theobjective of ridge regression at w = [1; 0] is likely to be close to0.25 + λ ⇒ we’ll choose wrong solution if λ is not too small
Crux of the problem: features have completely different scale while `2regularization treats them equally
Simple solution: normalize features to have the same range (dividingby max, or by standard deviation)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 22 / 28
![Page 73: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/73.jpg)
Example: The effect of Transformations
Consider 1-dim regression problem, y ∼ U(±1), a� 1, and
x =
{y w.p. (1− 1/a)
ay w.p. 1/a
It is easy to show that w∗ = 2a−1a2+a−1
so w∗ → 0 as a→∞It follows that LD(w∗)→ 0.5
But, if we apply “clipping”, x 7→ sign(x) min{1, |x|}, then LD(1) = 0
“Prior knowledge”: features that get values larger than a predefinedthreshold value give us no additional useful information, and thereforewe can clip them to the predefined threshold.
Of course, this “prior knowledge” can be wrong and it is easy toconstruct examples for which clipping hurts performance
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 23 / 28
![Page 74: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/74.jpg)
Example: The effect of Transformations
Consider 1-dim regression problem, y ∼ U(±1), a� 1, and
x =
{y w.p. (1− 1/a)
ay w.p. 1/a
It is easy to show that w∗ = 2a−1a2+a−1
so w∗ → 0 as a→∞
It follows that LD(w∗)→ 0.5
But, if we apply “clipping”, x 7→ sign(x) min{1, |x|}, then LD(1) = 0
“Prior knowledge”: features that get values larger than a predefinedthreshold value give us no additional useful information, and thereforewe can clip them to the predefined threshold.
Of course, this “prior knowledge” can be wrong and it is easy toconstruct examples for which clipping hurts performance
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 23 / 28
![Page 75: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/75.jpg)
Example: The effect of Transformations
Consider 1-dim regression problem, y ∼ U(±1), a� 1, and
x =
{y w.p. (1− 1/a)
ay w.p. 1/a
It is easy to show that w∗ = 2a−1a2+a−1
so w∗ → 0 as a→∞It follows that LD(w∗)→ 0.5
But, if we apply “clipping”, x 7→ sign(x) min{1, |x|}, then LD(1) = 0
“Prior knowledge”: features that get values larger than a predefinedthreshold value give us no additional useful information, and thereforewe can clip them to the predefined threshold.
Of course, this “prior knowledge” can be wrong and it is easy toconstruct examples for which clipping hurts performance
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 23 / 28
![Page 76: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/76.jpg)
Example: The effect of Transformations
Consider 1-dim regression problem, y ∼ U(±1), a� 1, and
x =
{y w.p. (1− 1/a)
ay w.p. 1/a
It is easy to show that w∗ = 2a−1a2+a−1
so w∗ → 0 as a→∞It follows that LD(w∗)→ 0.5
But, if we apply “clipping”, x 7→ sign(x) min{1, |x|}, then LD(1) = 0
“Prior knowledge”: features that get values larger than a predefinedthreshold value give us no additional useful information, and thereforewe can clip them to the predefined threshold.
Of course, this “prior knowledge” can be wrong and it is easy toconstruct examples for which clipping hurts performance
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 23 / 28
![Page 77: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/77.jpg)
Example: The effect of Transformations
Consider 1-dim regression problem, y ∼ U(±1), a� 1, and
x =
{y w.p. (1− 1/a)
ay w.p. 1/a
It is easy to show that w∗ = 2a−1a2+a−1
so w∗ → 0 as a→∞It follows that LD(w∗)→ 0.5
But, if we apply “clipping”, x 7→ sign(x) min{1, |x|}, then LD(1) = 0
“Prior knowledge”: features that get values larger than a predefinedthreshold value give us no additional useful information, and thereforewe can clip them to the predefined threshold.
Of course, this “prior knowledge” can be wrong and it is easy toconstruct examples for which clipping hurts performance
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 23 / 28
![Page 78: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/78.jpg)
Example: The effect of Transformations
Consider 1-dim regression problem, y ∼ U(±1), a� 1, and
x =
{y w.p. (1− 1/a)
ay w.p. 1/a
It is easy to show that w∗ = 2a−1a2+a−1
so w∗ → 0 as a→∞It follows that LD(w∗)→ 0.5
But, if we apply “clipping”, x 7→ sign(x) min{1, |x|}, then LD(1) = 0
“Prior knowledge”: features that get values larger than a predefinedthreshold value give us no additional useful information, and thereforewe can clip them to the predefined threshold.
Of course, this “prior knowledge” can be wrong and it is easy toconstruct examples for which clipping hurts performance
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 23 / 28
![Page 79: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/79.jpg)
Some Examples of Feature Transformations
Denote f = (f1, . . . , fm) ∈ Rm the values of the feature and f theempirical mean
Centering: fi ← fi − f .
Unit Range: fmax = maxi fi, fmin = mini fi, fi ← fi−fminfmax−fmin
.
Standardization: ν = 1m
∑mi=1(fi − f)2, fi ← fi−f√
ν.
Clipping: fi ← sign(fi) max{b, |fi|}Sigmoidal transformation: fi ← 1
1+exp(b fi)
Logarithmic transformation: fi ← log(b+ fi)
Unary representation for categorical features:fi 7→ (1[fi=1], . . . ,1[fi=k])
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 24 / 28
![Page 80: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/80.jpg)
Some Examples of Feature Transformations
Denote f = (f1, . . . , fm) ∈ Rm the values of the feature and f theempirical mean
Centering: fi ← fi − f .
Unit Range: fmax = maxi fi, fmin = mini fi, fi ← fi−fminfmax−fmin
.
Standardization: ν = 1m
∑mi=1(fi − f)2, fi ← fi−f√
ν.
Clipping: fi ← sign(fi) max{b, |fi|}Sigmoidal transformation: fi ← 1
1+exp(b fi)
Logarithmic transformation: fi ← log(b+ fi)
Unary representation for categorical features:fi 7→ (1[fi=1], . . . ,1[fi=k])
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 24 / 28
![Page 81: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/81.jpg)
Some Examples of Feature Transformations
Denote f = (f1, . . . , fm) ∈ Rm the values of the feature and f theempirical mean
Centering: fi ← fi − f .
Unit Range: fmax = maxi fi, fmin = mini fi, fi ← fi−fminfmax−fmin
.
Standardization: ν = 1m
∑mi=1(fi − f)2, fi ← fi−f√
ν.
Clipping: fi ← sign(fi) max{b, |fi|}Sigmoidal transformation: fi ← 1
1+exp(b fi)
Logarithmic transformation: fi ← log(b+ fi)
Unary representation for categorical features:fi 7→ (1[fi=1], . . . ,1[fi=k])
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 24 / 28
![Page 82: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/82.jpg)
Some Examples of Feature Transformations
Denote f = (f1, . . . , fm) ∈ Rm the values of the feature and f theempirical mean
Centering: fi ← fi − f .
Unit Range: fmax = maxi fi, fmin = mini fi, fi ← fi−fminfmax−fmin
.
Standardization: ν = 1m
∑mi=1(fi − f)2, fi ← fi−f√
ν.
Clipping: fi ← sign(fi) max{b, |fi|}Sigmoidal transformation: fi ← 1
1+exp(b fi)
Logarithmic transformation: fi ← log(b+ fi)
Unary representation for categorical features:fi 7→ (1[fi=1], . . . ,1[fi=k])
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 24 / 28
![Page 83: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/83.jpg)
Some Examples of Feature Transformations
Denote f = (f1, . . . , fm) ∈ Rm the values of the feature and f theempirical mean
Centering: fi ← fi − f .
Unit Range: fmax = maxi fi, fmin = mini fi, fi ← fi−fminfmax−fmin
.
Standardization: ν = 1m
∑mi=1(fi − f)2, fi ← fi−f√
ν.
Clipping: fi ← sign(fi) max{b, |fi|}
Sigmoidal transformation: fi ← 11+exp(b fi)
Logarithmic transformation: fi ← log(b+ fi)
Unary representation for categorical features:fi 7→ (1[fi=1], . . . ,1[fi=k])
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 24 / 28
![Page 84: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/84.jpg)
Some Examples of Feature Transformations
Denote f = (f1, . . . , fm) ∈ Rm the values of the feature and f theempirical mean
Centering: fi ← fi − f .
Unit Range: fmax = maxi fi, fmin = mini fi, fi ← fi−fminfmax−fmin
.
Standardization: ν = 1m
∑mi=1(fi − f)2, fi ← fi−f√
ν.
Clipping: fi ← sign(fi) max{b, |fi|}Sigmoidal transformation: fi ← 1
1+exp(b fi)
Logarithmic transformation: fi ← log(b+ fi)
Unary representation for categorical features:fi 7→ (1[fi=1], . . . ,1[fi=k])
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 24 / 28
![Page 85: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/85.jpg)
Some Examples of Feature Transformations
Denote f = (f1, . . . , fm) ∈ Rm the values of the feature and f theempirical mean
Centering: fi ← fi − f .
Unit Range: fmax = maxi fi, fmin = mini fi, fi ← fi−fminfmax−fmin
.
Standardization: ν = 1m
∑mi=1(fi − f)2, fi ← fi−f√
ν.
Clipping: fi ← sign(fi) max{b, |fi|}Sigmoidal transformation: fi ← 1
1+exp(b fi)
Logarithmic transformation: fi ← log(b+ fi)
Unary representation for categorical features:fi 7→ (1[fi=1], . . . ,1[fi=k])
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 24 / 28
![Page 86: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/86.jpg)
Some Examples of Feature Transformations
Denote f = (f1, . . . , fm) ∈ Rm the values of the feature and f theempirical mean
Centering: fi ← fi − f .
Unit Range: fmax = maxi fi, fmin = mini fi, fi ← fi−fminfmax−fmin
.
Standardization: ν = 1m
∑mi=1(fi − f)2, fi ← fi−f√
ν.
Clipping: fi ← sign(fi) max{b, |fi|}Sigmoidal transformation: fi ← 1
1+exp(b fi)
Logarithmic transformation: fi ← log(b+ fi)
Unary representation for categorical features:fi 7→ (1[fi=1], . . . ,1[fi=k])
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 24 / 28
![Page 87: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/87.jpg)
Outline
1 Feature SelectionFiltersGreedy selection`1 norm
2 Feature Manipulation and Normalization
3 Feature Learning
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 25 / 28
![Page 88: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/88.jpg)
Feature Learning
Goal: learn a feature mapping, ψ : X → Rd, so that a linear predictoron top of ψ(x) will yield a good hypothesis class
Example: we can think on the first layers of a neural network as ψ(x)and the last layer as the linear predictor applied on top of it
We will describe an unsupervised learning approach for featurelearning called Dictionary learning
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 26 / 28
![Page 89: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/89.jpg)
Feature Learning
Goal: learn a feature mapping, ψ : X → Rd, so that a linear predictoron top of ψ(x) will yield a good hypothesis class
Example: we can think on the first layers of a neural network as ψ(x)and the last layer as the linear predictor applied on top of it
We will describe an unsupervised learning approach for featurelearning called Dictionary learning
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 26 / 28
![Page 90: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/90.jpg)
Feature Learning
Goal: learn a feature mapping, ψ : X → Rd, so that a linear predictoron top of ψ(x) will yield a good hypothesis class
Example: we can think on the first layers of a neural network as ψ(x)and the last layer as the linear predictor applied on top of it
We will describe an unsupervised learning approach for featurelearning called Dictionary learning
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 26 / 28
![Page 91: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/91.jpg)
Dictionary Learning
Motivation: recall the description of a document as a “bag-of-words”:ψ(x) ∈ {0, 1}k where coordinate i of ψ(x) determines if word iappears in the document or not
What is the dictionary in general ? For example, what will be a gooddictionary for visual data ? Can we learn ψ : X → {0, 1}k thatcaptures “visual words”, e.g., (ψ(x))i captures something like “thereis an eye in the image” ?
Using clustering: A clustering function c : X → {1, . . . , k} yields themapping ψ(x)i = 1 iff x belongs to cluster i
Sparse auto-encoders: Given x ∈ Rd and dictionary matrix D ∈ Rd,k,let
ψ(x) = argminv∈Rk
‖x−Dv‖ s.t. ‖v‖0 ≤ s
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 27 / 28
![Page 92: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/92.jpg)
Dictionary Learning
Motivation: recall the description of a document as a “bag-of-words”:ψ(x) ∈ {0, 1}k where coordinate i of ψ(x) determines if word iappears in the document or not
What is the dictionary in general ? For example, what will be a gooddictionary for visual data ? Can we learn ψ : X → {0, 1}k thatcaptures “visual words”, e.g., (ψ(x))i captures something like “thereis an eye in the image” ?
Using clustering: A clustering function c : X → {1, . . . , k} yields themapping ψ(x)i = 1 iff x belongs to cluster i
Sparse auto-encoders: Given x ∈ Rd and dictionary matrix D ∈ Rd,k,let
ψ(x) = argminv∈Rk
‖x−Dv‖ s.t. ‖v‖0 ≤ s
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 27 / 28
![Page 93: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/93.jpg)
Dictionary Learning
Motivation: recall the description of a document as a “bag-of-words”:ψ(x) ∈ {0, 1}k where coordinate i of ψ(x) determines if word iappears in the document or not
What is the dictionary in general ? For example, what will be a gooddictionary for visual data ? Can we learn ψ : X → {0, 1}k thatcaptures “visual words”, e.g., (ψ(x))i captures something like “thereis an eye in the image” ?
Using clustering: A clustering function c : X → {1, . . . , k} yields themapping ψ(x)i = 1 iff x belongs to cluster i
Sparse auto-encoders: Given x ∈ Rd and dictionary matrix D ∈ Rd,k,let
ψ(x) = argminv∈Rk
‖x−Dv‖ s.t. ‖v‖0 ≤ s
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 27 / 28
![Page 94: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/94.jpg)
Dictionary Learning
Motivation: recall the description of a document as a “bag-of-words”:ψ(x) ∈ {0, 1}k where coordinate i of ψ(x) determines if word iappears in the document or not
What is the dictionary in general ? For example, what will be a gooddictionary for visual data ? Can we learn ψ : X → {0, 1}k thatcaptures “visual words”, e.g., (ψ(x))i captures something like “thereis an eye in the image” ?
Using clustering: A clustering function c : X → {1, . . . , k} yields themapping ψ(x)i = 1 iff x belongs to cluster i
Sparse auto-encoders: Given x ∈ Rd and dictionary matrix D ∈ Rd,k,let
ψ(x) = argminv∈Rk
‖x−Dv‖ s.t. ‖v‖0 ≤ s
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 27 / 28
![Page 95: Introduction to Machine Learning (67577) Lecture 13shais/Lectures2014/lecture13.pdf · 2014-06-19 · At each iteration, go over all i=2Iand learn a predictor based on I[i Choose](https://reader033.fdocuments.us/reader033/viewer/2022050510/5f9b217642834004ee16db85/html5/thumbnails/95.jpg)
Summary
Feature selection
Feature normalization and manipulations
Feature learning
Shai Shalev-Shwartz (Hebrew U) IML Lecture 13 Features 28 / 28