Bias / Variance Analysis, Kernel Methods
Transcript of Bias / Variance Analysis, Kernel Methods
![Page 1: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/1.jpg)
Bias / Variance Analysis, Kernel Methods
Professor Ameet Talwalkar
Slide Credit: Professor Fei Sha
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 1 / 44
![Page 2: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/2.jpg)
Outline
1 Administration
2 Review of last lecture
3 Bias/Variance Analysis
4 Kernel methods
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 2 / 44
![Page 3: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/3.jpg)
Announcements
HW3 and HW4 due now
HW1 and HW2 can be picked up in class; grades available online
Project proposal due a week from today
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 3 / 44
![Page 4: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/4.jpg)
Outline
1 Administration
2 Review of last lectureBasic ideas to combat overfittingRidge Regression and MAP estimate
3 Bias/Variance Analysis
4 Kernel methods
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 4 / 44
![Page 5: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/5.jpg)
Overfitting
Example with regression, using polynomial basis functions
φ(x) =
1xx2
...xM
⇒ f(x) = w0+
M∑m=1
wmxm
x
t
M = 9
0 1
−1
0
1
Being too adaptive can improve training accuracy at the expense of testaccuracy
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 5 / 44
![Page 6: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/6.jpg)
Visualizing overfitting
As model becomes morecomplex, training errorkeeps improving whiletest error first improvesthen deteriorates
M
ERMS
0 3 6 90
0.5
1TrainingTest
Horizontal axis: measure of model complexity, e.g., the maximumdegree of the polynomial basis functions.
Vertical axis: some notion of error, e.g., RSS for regression
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 6 / 44
![Page 7: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/7.jpg)
How to prevent overfitting?Use more training data
x
t
N = 100
0 1
−1
0
1
Regularization: adding a term to the objective function
λ‖w‖22
that favors a small parameter vector w.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 7 / 44
![Page 8: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/8.jpg)
Probabilistic interpretation of Ridge Regression
Ridge Regression model: Y = w>X + ηI Y ∼ N(w>X, σ2
0) is a Gaussian random variable (as before)I wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)I Note that all wd share the same variance σ2
w is random (Bayesian interpretation)
To find w given data D, we can compute posterior distribution of w:
p(w|D) = p(D|w)p(w)
p(D)
Maximum a posterior (MAP) estimate:
wmap = argmaxw p(w|D) = argmaxw p(D,w)
MAP reduces to MLE if we assume uniform prior for p(w)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 8 / 44
![Page 9: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/9.jpg)
Probabilistic interpretation of Ridge Regression
Ridge Regression model: Y = w>X + ηI Y ∼ N(w>X, σ2
0) is a Gaussian random variable (as before)I wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)I Note that all wd share the same variance σ2
w is random (Bayesian interpretation)
To find w given data D, we can compute posterior distribution of w:
p(w|D) = p(D|w)p(w)
p(D)
Maximum a posterior (MAP) estimate:
wmap = argmaxw p(w|D) = argmaxw p(D,w)
MAP reduces to MLE if we assume uniform prior for p(w)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 8 / 44
![Page 10: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/10.jpg)
Probabilistic interpretation of Ridge Regression
Ridge Regression model: Y = w>X + ηI Y ∼ N(w>X, σ2
0) is a Gaussian random variable (as before)I wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)I Note that all wd share the same variance σ2
w is random (Bayesian interpretation)
To find w given data D, we can compute posterior distribution of w:
p(w|D) = p(D|w)p(w)
p(D)
Maximum a posterior (MAP) estimate:
wmap = argmaxw p(w|D) = argmaxw p(D,w)
MAP reduces to MLE if we assume uniform prior for p(w)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 8 / 44
![Page 11: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/11.jpg)
Probabilistic interpretation of Ridge Regression
Ridge Regression model: Y = w>X + ηI Y ∼ N(w>X, σ2
0) is a Gaussian random variable (as before)I wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)I Note that all wd share the same variance σ2
w is random (Bayesian interpretation)
To find w given data D, we can compute posterior distribution of w:
p(w|D) = p(D|w)p(w)
p(D)
Maximum a posterior (MAP) estimate:
wmap = argmaxw p(w|D) = argmaxw p(D,w)
MAP reduces to MLE if we assume uniform prior for p(w)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 8 / 44
![Page 12: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/12.jpg)
Overfitting in terms of λOverfitting is reduced as we increase the regularizer
x
t
M = 9
0 1
−1
0
1
x
t
ln λ = −18
0 1
−1
0
1
x
t
ln λ = 0
0 1
−1
0
1
λ vs. residual error shows the difference of the model performance ontraining and testing dataset
ERMS
ln λ−35 −30 −25 −200
0.5
1TrainingTest
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 9 / 44
![Page 13: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/13.jpg)
Overfitting in terms of λOverfitting is reduced as we increase the regularizer
x
t
M = 9
0 1
−1
0
1
x
t
ln λ = −18
0 1
−1
0
1
x
t
ln λ = 0
0 1
−1
0
1
λ vs. residual error shows the difference of the model performance ontraining and testing dataset
ERMS
ln λ−35 −30 −25 −200
0.5
1TrainingTest
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 9 / 44
![Page 14: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/14.jpg)
Outline
1 Administration
2 Review of last lecture
3 Bias/Variance Analysis
4 Kernel methods
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 10 / 44
![Page 15: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/15.jpg)
Basic and important machine learning concepts
Supervised learningWe aim to build a function h(x) to predict the true value y associatedwith x. If we make a mistake, we incur a loss
`(h(x), y)
Example: quadratic loss function for regression when y is continuous
`(h(x), y) = [h(x)− y]2
Ex: when y = 0
h(x)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 11 / 44
![Page 16: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/16.jpg)
Basic and important machine learning concepts
Supervised learningWe aim to build a function h(x) to predict the true value y associatedwith x. If we make a mistake, we incur a loss
`(h(x), y)
Example: quadratic loss function for regression when y is continuous
`(h(x), y) = [h(x)− y]2
Ex: when y = 0
h(x)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 11 / 44
![Page 17: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/17.jpg)
Other types of loss functions
For classification: cross-entropy loss (also called logistic loss)
`(h(x), y) = −y log h(x)−(1−y) log[1−h(x)]
Ex: when y = 1
h(x)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 12 / 44
![Page 18: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/18.jpg)
Measure how good our predictor is
Risk: assume we know the true distribution of data p(x, y), the risk is
R[h(x)] =
∫x,y
`(h(x), y)p(x, y)dxd y
However, we cannot compute R[h(x)], so we use empirical risk, given atraining dataset D
Remp[h(x)] =1
N
∑n
`(h(xn), yn)
Intuitively, as N → +∞,
Remp[h(x)]→ R[h(x)]
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 13 / 44
![Page 19: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/19.jpg)
Measure how good our predictor is
Risk: assume we know the true distribution of data p(x, y), the risk is
R[h(x)] =
∫x,y
`(h(x), y)p(x, y)dxd y
However, we cannot compute R[h(x)], so we use empirical risk, given atraining dataset D
Remp[h(x)] =1
N
∑n
`(h(xn), yn)
Intuitively, as N → +∞,
Remp[h(x)]→ R[h(x)]
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 13 / 44
![Page 20: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/20.jpg)
Measure how good our predictor is
Risk: assume we know the true distribution of data p(x, y), the risk is
R[h(x)] =
∫x,y
`(h(x), y)p(x, y)dxd y
However, we cannot compute R[h(x)], so we use empirical risk, given atraining dataset D
Remp[h(x)] =1
N
∑n
`(h(xn), yn)
Intuitively, as N → +∞,
Remp[h(x)]→ R[h(x)]
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 13 / 44
![Page 21: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/21.jpg)
How this relates to what we have learned?
So far, we have been doing empirical risk minimization (ERM)
For linear regression, h(x) = wTx, and we use squared loss
For logistic regression, h(x) = σ(wTx), and we use cross-entropy loss
ERM might be problematic
If h(x) is complicated enough,
Remp[h(x)]→ 0
But then h(x) is unlikely to do well in predicting things out of thetraining dataset DThis is called poor generalization or overfitting. We have justdiscussed approaches to address this issue.
Let’s try to understand why regularization might work from thecontext of the bias-variance tradeoff, focusing on regression / squaredloss
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 14 / 44
![Page 22: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/22.jpg)
How this relates to what we have learned?
So far, we have been doing empirical risk minimization (ERM)
For linear regression, h(x) = wTx, and we use squared loss
For logistic regression, h(x) = σ(wTx), and we use cross-entropy loss
ERM might be problematic
If h(x) is complicated enough,
Remp[h(x)]→ 0
But then h(x) is unlikely to do well in predicting things out of thetraining dataset DThis is called poor generalization or overfitting. We have justdiscussed approaches to address this issue.
Let’s try to understand why regularization might work from thecontext of the bias-variance tradeoff, focusing on regression / squaredloss
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 14 / 44
![Page 23: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/23.jpg)
Bias/variance tradeoff (Looking ahead)
Error decomposes into 3 terms
EDR[hD(x)] = variance+ bias2 + noise
We will prove this result, and interpret what it means...
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 15 / 44
![Page 24: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/24.jpg)
Bias/variance tradeoff for regression
Goal: to understand the sources of prediction errors
D: our training data
hD(x): our prediction functionWe are using the subscript D to indicate that the prediction functionis learned on the specific set of training data D`(h(x), y): our square loss function for regression
`(hD(x), y) = [hD(x)− y]2
Unknown joint distribution p(x, y)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 16 / 44
![Page 25: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/25.jpg)
Bias/variance tradeoff for regression
Goal: to understand the sources of prediction errors
D: our training data
hD(x): our prediction functionWe are using the subscript D to indicate that the prediction functionis learned on the specific set of training data D
`(h(x), y): our square loss function for regression
`(hD(x), y) = [hD(x)− y]2
Unknown joint distribution p(x, y)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 16 / 44
![Page 26: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/26.jpg)
Bias/variance tradeoff for regression
Goal: to understand the sources of prediction errors
D: our training data
hD(x): our prediction functionWe are using the subscript D to indicate that the prediction functionis learned on the specific set of training data D`(h(x), y): our square loss function for regression
`(hD(x), y) = [hD(x)− y]2
Unknown joint distribution p(x, y)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 16 / 44
![Page 27: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/27.jpg)
Bias/variance tradeoff for regression
Goal: to understand the sources of prediction errors
D: our training data
hD(x): our prediction functionWe are using the subscript D to indicate that the prediction functionis learned on the specific set of training data D`(h(x), y): our square loss function for regression
`(hD(x), y) = [hD(x)− y]2
Unknown joint distribution p(x, y)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 16 / 44
![Page 28: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/28.jpg)
The effect of finite training samples
Every training sample D is a sample from the following jointdistribution
D ∼ P (D) =N∏n=1
p(xn, yn)
The prediction function hD(x) is a random function with respect to thisdistribution, and thus its risk is as well
R[hD(x)] =
∫x
∫y[hD(x)− y]2p(x, y)dxdy
Next we will disentangle the impact of the finite sample D when assessingthe quality of h(·)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 17 / 44
![Page 29: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/29.jpg)
The effect of finite training samples
Every training sample D is a sample from the following jointdistribution
D ∼ P (D) =N∏n=1
p(xn, yn)
The prediction function hD(x) is a random function with respect to thisdistribution, and thus its risk is as well
R[hD(x)] =
∫x
∫y[hD(x)− y]2p(x, y)dxdy
Next we will disentangle the impact of the finite sample D when assessingthe quality of h(·)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 17 / 44
![Page 30: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/30.jpg)
The effect of finite training samples
Every training sample D is a sample from the following jointdistribution
D ∼ P (D) =N∏n=1
p(xn, yn)
The prediction function hD(x) is a random function with respect to thisdistribution, and thus its risk is as well
R[hD(x)] =
∫x
∫y[hD(x)− y]2p(x, y)dxdy
Next we will disentangle the impact of the finite sample D when assessingthe quality of h(·)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 17 / 44
![Page 31: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/31.jpg)
Average over the distribution of the training data
Averaged risk
EDR[hD(x)]
=
∫D
∫x
∫y[hD(x)− y]2p(x, y)dxdy P (D)dD
Namely, the randomness with respect to D is marginalized out.
Averaged prediction
EDhD(x) =∫DhD(x)P (D)dD
Namely, if we have seen many training datasets, we predict with theaverage of our trained models learned on each training dataset.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 18 / 44
![Page 32: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/32.jpg)
Average over the distribution of the training data
Averaged risk
EDR[hD(x)] =∫D
∫x
∫y[hD(x)− y]2p(x, y)dxdy P (D)dD
Namely, the randomness with respect to D is marginalized out.
Averaged prediction
EDhD(x) =∫DhD(x)P (D)dD
Namely, if we have seen many training datasets, we predict with theaverage of our trained models learned on each training dataset.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 18 / 44
![Page 33: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/33.jpg)
Average over the distribution of the training data
Averaged risk
EDR[hD(x)] =∫D
∫x
∫y[hD(x)− y]2p(x, y)dxdy P (D)dD
Namely, the randomness with respect to D is marginalized out.
Averaged prediction
EDhD(x) =∫DhD(x)P (D)dD
Namely, if we have seen many training datasets, we predict with theaverage of our trained models learned on each training dataset.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 18 / 44
![Page 34: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/34.jpg)
Variance
We can add and subtract averaged prediction from averaged risk
EDR[hD(x)] =∫D
∫x
∫y[hD(x)− y]2p(x, y)dxdy P (D)dD
=
∫D
∫x
∫y[hD(x)
−EDhD(x)
+EDhD(x)− y]2p(x, y)dxdy P (D)dD
=
∫D
∫x
∫y[hD(x)− EDhD(x)]2p(x, y)dxdy P (D)dD︸ ︷︷ ︸
variance
+
∫D
∫x
∫y[EDhD(x)− y]2p(x, y)dxdy P (D)dD
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 19 / 44
![Page 35: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/35.jpg)
Variance
We can add and subtract averaged prediction from averaged risk
EDR[hD(x)] =∫D
∫x
∫y[hD(x)− y]2p(x, y)dxdy P (D)dD
=
∫D
∫x
∫y[hD(x)−EDhD(x)
+EDhD(x)− y]2p(x, y)dxdy P (D)dD
=
∫D
∫x
∫y[hD(x)− EDhD(x)]2p(x, y)dxdy P (D)dD︸ ︷︷ ︸
variance
+
∫D
∫x
∫y[EDhD(x)− y]2p(x, y)dxdy P (D)dD
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 19 / 44
![Page 36: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/36.jpg)
Variance
We can add and subtract averaged prediction from averaged risk
EDR[hD(x)] =∫D
∫x
∫y[hD(x)− y]2p(x, y)dxdy P (D)dD
=
∫D
∫x
∫y[hD(x)−EDhD(x)
+EDhD(x)− y]2p(x, y)dxdy P (D)dD
=
∫D
∫x
∫y[hD(x)− EDhD(x)]2p(x, y)dxdy P (D)dD︸ ︷︷ ︸
variance
+
∫D
∫x
∫y[EDhD(x)− y]2p(x, y)dxdy P (D)dD
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 19 / 44
![Page 37: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/37.jpg)
Variance
We can add and subtract averaged prediction from averaged risk
EDR[hD(x)] =∫D
∫x
∫y[hD(x)− y]2p(x, y)dxdy P (D)dD
=
∫D
∫x
∫y[hD(x)−EDhD(x)
+EDhD(x)− y]2p(x, y)dxdy P (D)dD
=
∫D
∫x
∫y[hD(x)− EDhD(x)]2p(x, y)dxdy P (D)dD︸ ︷︷ ︸
variance
+
∫D
∫x
∫y[EDhD(x)− y]2p(x, y)dxdy P (D)dD
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 19 / 44
![Page 38: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/38.jpg)
Where does the cross-term go?
It is zero∫D
∫x
∫y[hD(x)− EDhD(x)][EDhD(x)− y]p(x, y)dxdy P (D)dD
=
∫x
∫y
{∫D[hD(x)− EDhD(x)]P (D)dD
}[EDhD(x)− y]p(x, y)dxdy
= 0← (the integral within the braces vanishes, by definition)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 20 / 44
![Page 39: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/39.jpg)
Where does the cross-term go?
It is zero∫D
∫x
∫y[hD(x)− EDhD(x)][EDhD(x)− y]p(x, y)dxdy P (D)dD
=
∫x
∫y
{∫D[hD(x)− EDhD(x)]P (D)dD
}[EDhD(x)− y]p(x, y)dxdy
= 0← (the integral within the braces vanishes, by definition)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 20 / 44
![Page 40: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/40.jpg)
Where does the cross-term go?
It is zero∫D
∫x
∫y[hD(x)− EDhD(x)][EDhD(x)− y]p(x, y)dxdy P (D)dD
=
∫x
∫y
{∫D[hD(x)− EDhD(x)]P (D)dD
}[EDhD(x)− y]p(x, y)dxdy
= 0← (the integral within the braces vanishes, by definition)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 20 / 44
![Page 41: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/41.jpg)
Analyzing the variance
How can we reduce the variance?∫D
∫x
∫y[hD(x)− EDhD(x)]2p(x, y)dxdy P (D)dD
Use a lot of data (ie, increase the size of D)
Use a simple h(·) so that hD(x) does not vary much across differenttraining datasets.Ex: h(x) = const
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 21 / 44
![Page 42: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/42.jpg)
Analyzing the variance
How can we reduce the variance?∫D
∫x
∫y[hD(x)− EDhD(x)]2p(x, y)dxdy P (D)dD
Use a lot of data (ie, increase the size of D)
Use a simple h(·) so that hD(x) does not vary much across differenttraining datasets.
Ex: h(x) = const
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 21 / 44
![Page 43: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/43.jpg)
Analyzing the variance
How can we reduce the variance?∫D
∫x
∫y[hD(x)− EDhD(x)]2p(x, y)dxdy P (D)dD
Use a lot of data (ie, increase the size of D)
Use a simple h(·) so that hD(x) does not vary much across differenttraining datasets.Ex: h(x) = const
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 21 / 44
![Page 44: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/44.jpg)
The remaining item
EDR[hD(x)] =∫D
∫x
∫y[hD(x)− EDhD(x)]2p(x, y)dxdy P (D)dD
+
∫D
∫x
∫y[EDhD(x)− y]2p(x, y)dxdy P (D)dD
The integrand has no dependency on D anymore and simplifies to∫x
∫y[EDhD(x)− y]2p(x, y)dxdy
We will apply a similar trick, by using an averaged target y
Ey[y] =∫yyp(y|x)dy
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 22 / 44
![Page 45: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/45.jpg)
The remaining item
EDR[hD(x)] =∫D
∫x
∫y[hD(x)− EDhD(x)]2p(x, y)dxdy P (D)dD
+
∫D
∫x
∫y[EDhD(x)− y]2p(x, y)dxdy P (D)dD
The integrand has no dependency on D anymore and simplifies to∫x
∫y[EDhD(x)− y]2p(x, y)dxdy
We will apply a similar trick, by using an averaged target y
Ey[y] =∫yyp(y|x)dy
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 22 / 44
![Page 46: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/46.jpg)
The remaining item
EDR[hD(x)] =∫D
∫x
∫y[hD(x)− EDhD(x)]2p(x, y)dxdy P (D)dD
+
∫D
∫x
∫y[EDhD(x)− y]2p(x, y)dxdy P (D)dD
The integrand has no dependency on D anymore and simplifies to∫x
∫y[EDhD(x)− y]2p(x, y)dxdy
We will apply a similar trick, by using an averaged target y
Ey[y] =∫yyp(y|x)dy
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 22 / 44
![Page 47: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/47.jpg)
Bias and noise
Decompose again∫x
∫y[EDhD(x)− y]2p(x, y)dxdy
=
∫x
∫y[EDhD(x)−Ey[y] + Ey[y]− y]2p(x, y)dxdy
=
∫x
∫y[EDhD(x)− Ey[y]]2p(x, y)dxdy︸ ︷︷ ︸
bias2
+
∫x
∫y[Ey[y]− y]2p(x, y)dxdy︸ ︷︷ ︸
noise
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 23 / 44
![Page 48: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/48.jpg)
Where is the cross-term?
Left as the take-home exercise
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 24 / 44
![Page 49: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/49.jpg)
Analyzing the noise
How can we reduce noise∫x
∫y[Ey[y]− y]2p(x, y)dxdy =
∫x
(∫y[Ey[y]− y]2p(y|x)dy
)p(x)dx
There is nothing we can do. This quantity depends on p(x, y) only;choosing h(·) or the training dataset D will not affect it. Note that theintegral inside the parentheses is the variance (noise) of the distributionp(y|x) at the given x.
More difficult
y
p(y|x)
Ey[y]
Easier
y
p(y|x)
Ey[y]
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 25 / 44
![Page 50: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/50.jpg)
Analyzing the noise
How can we reduce noise∫x
∫y[Ey[y]− y]2p(x, y)dxdy =
∫x
(∫y[Ey[y]− y]2p(y|x)dy
)p(x)dx
There is nothing we can do. This quantity depends on p(x, y) only;choosing h(·) or the training dataset D will not affect it. Note that theintegral inside the parentheses is the variance (noise) of the distributionp(y|x) at the given x.
More difficult
y
p(y|x)
Ey[y]
Easier
y
p(y|x)
Ey[y]
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 25 / 44
![Page 51: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/51.jpg)
Analyzing the bias term
How can we reduce bias?∫x
∫y[EDhD(x)− Ey[y]]2p(x, y)dxdy =
∫x[EDhD(x)− Ey[y]]2p(x)dx
It can be reduced by using more complex models. We can choose h(·) tobe as flexible as possible to better h(·) approximate Ey[y] and reduce bias.However, this increased flexibility will increase the variance term (as wecan potentially overfit to training data).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 26 / 44
![Page 52: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/52.jpg)
Analyzing the bias term
How can we reduce bias?∫x
∫y[EDhD(x)− Ey[y]]2p(x, y)dxdy =
∫x[EDhD(x)− Ey[y]]2p(x)dx
It can be reduced by using more complex models. We can choose h(·) tobe as flexible as possible to better h(·) approximate Ey[y] and reduce bias.However, this increased flexibility will increase the variance term (as wecan potentially overfit to training data).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 26 / 44
![Page 53: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/53.jpg)
Bias/variance tradeoffError decomposes into 3 terms
EDR[hD(x)] = variance+ bias2 + noise
where the first and the second term are inherently in conflict in terms ofchoosing what kind of h(x) we should use (unless we have an infiniteamount of data).
If we can compute all terms analytically, they will look like this
ln λ
−3 −2 −1 0 1 20
0.03
0.06
0.09
0.12
0.15
(bias)2
variance
(bias)2 + variancetest error
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 27 / 44
![Page 54: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/54.jpg)
Bias/variance tradeoffError decomposes into 3 terms
EDR[hD(x)] = variance+ bias2 + noise
where the first and the second term are inherently in conflict in terms ofchoosing what kind of h(x) we should use (unless we have an infiniteamount of data).
If we can compute all terms analytically, they will look like this
ln λ
−3 −2 −1 0 1 20
0.03
0.06
0.09
0.12
0.15
(bias)2
variance
(bias)2 + variancetest error
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 27 / 44
![Page 55: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/55.jpg)
Outline
1 Administration
2 Review of last lecture
3 Bias/Variance Analysis
4 Kernel methodsMotivationKernel matrix and kernel functionsKernelized machine learning methods
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 28 / 44
![Page 56: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/56.jpg)
MotivationHow to choose nonlinear basis function for regression?
wTφ(x)
where φ(·) maps the original feature vector x to a M -dimensional newfeature vector. In the following, we will show that we can sidestep theissue of choosing which φ(·) to use — instead, we will choose equivalentlya kernel function.
Regularized least square
J(w) =1
2
∑n
(yn −wTφ(xn))2 +
λ
2‖w‖22
Its solution wmap is given by
∂J(w)
∂w=∑n
(yn −wTφ(xn))(−φ(xn)) + λw = 0
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 29 / 44
![Page 57: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/57.jpg)
MotivationHow to choose nonlinear basis function for regression?
wTφ(x)
where φ(·) maps the original feature vector x to a M -dimensional newfeature vector. In the following, we will show that we can sidestep theissue of choosing which φ(·) to use — instead, we will choose equivalentlya kernel function.
Regularized least square
J(w) =1
2
∑n
(yn −wTφ(xn))2 +
λ
2‖w‖22
Its solution wmap is given by
∂J(w)
∂w=∑n
(yn −wTφ(xn))(−φ(xn)) + λw = 0
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 29 / 44
![Page 58: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/58.jpg)
MAP Solution
The optimal parameter vector is a linear combination of features
wmap =∑n
1
λ(yn −wTφ(xn))φ(xn) =
∑n
αnφ(xn) = ΦTα
where we have designated 1λ(yn −wTφ(xn)) as αn. And the matrix Φ is
the design matrix made of transformed features. Its transpose is made ofcolumn vectors and is given by
ΦT = (φ(x1) φ(x2) · · · φ(xN )) ∈ RM×N
where M is the dimensionality of φ(x).
Of course, we do not know what α (the vector of all αn) corresponds towmap!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 30 / 44
![Page 59: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/59.jpg)
MAP Solution
The optimal parameter vector is a linear combination of features
wmap =∑n
1
λ(yn −wTφ(xn))φ(xn) =
∑n
αnφ(xn) = ΦTα
where we have designated 1λ(yn −wTφ(xn)) as αn. And the matrix Φ is
the design matrix made of transformed features. Its transpose is made ofcolumn vectors and is given by
ΦT = (φ(x1) φ(x2) · · · φ(xN )) ∈ RM×N
where M is the dimensionality of φ(x).
Of course, we do not know what α (the vector of all αn) corresponds towmap!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 30 / 44
![Page 60: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/60.jpg)
Dual formulation
We substitute wmap = ΦTα into J(w), and obtain the following functionof α
J(α) =1
2αTΦΦTΦΦTα− (ΦΦTy)Tα+
λ
2αTΦΦTα
Before we show how J(α) is derived, we make an important observation.We see repeated structures ΦΦT, to which we refer as Gram matrix orkernel matrix
K = ΦΦT
=
φ(x1)
Tφ(x1) φ(x1)Tφ(x2) · · · φ(x1)
Tφ(xN )φ(x2)
Tφ(x1) φ(x2)Tφ(x2) · · · φ(x2)
Tφ(xN )· · · · · · · · · · · ·
φ(xN )Tφ(x1) φ(xN )
Tφ(x2) · · · φ(xN )Tφ(xN )
∈ RN×N
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 31 / 44
![Page 61: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/61.jpg)
Dual formulation
We substitute wmap = ΦTα into J(w), and obtain the following functionof α
J(α) =1
2αTΦΦTΦΦTα− (ΦΦTy)Tα+
λ
2αTΦΦTα
Before we show how J(α) is derived, we make an important observation.We see repeated structures ΦΦT, to which we refer as Gram matrix orkernel matrix
K = ΦΦT
=
φ(x1)
Tφ(x1) φ(x1)Tφ(x2) · · · φ(x1)
Tφ(xN )φ(x2)
Tφ(x1) φ(x2)Tφ(x2) · · · φ(x2)
Tφ(xN )· · · · · · · · · · · ·
φ(xN )Tφ(x1) φ(xN )
Tφ(x2) · · · φ(xN )Tφ(xN )
∈ RN×N
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 31 / 44
![Page 62: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/62.jpg)
Properties of the matrix K
Symmetric
Kmn = φ(xm)Tφ(xn) = φ(xn)
Tφ(xm) = Knm
Positive semidefinite: for any vector a
aTKa = (ΦTa)T(ΦTa) ≥ 0
Not the same as the second-moment (covariance) matrix C = ΦTΦ1 C has a size of D ×D while K is N ×N .
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 32 / 44
![Page 63: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/63.jpg)
Properties of the matrix K
Symmetric
Kmn = φ(xm)Tφ(xn) = φ(xn)
Tφ(xm) = Knm
Positive semidefinite: for any vector a
aTKa = (ΦTa)T(ΦTa) ≥ 0
Not the same as the second-moment (covariance) matrix C = ΦTΦ1 C has a size of D ×D while K is N ×N .
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 32 / 44
![Page 64: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/64.jpg)
The derivation of J(α)
J(w) =1
2
∑n
(y −wTφ(xn))2 +
λ
2‖w‖22
=1
2‖y −Φw‖22 +
λ
2‖w‖22
=1
2‖y −ΦΦTα‖22 +
λ
2‖ΦTα‖22
=1
2‖y −Kα‖22 +
λ
2αTΦΦTα
∝ 1
2αTKTKα− yTKα+
λ
2αTKα
=1
2αTK2α− (Ky)Tα+
λ
2αTKα = J(α)
where we have used the property that K is symmetric.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 33 / 44
![Page 65: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/65.jpg)
The derivation of J(α)
J(w) =1
2
∑n
(y −wTφ(xn))2 +
λ
2‖w‖22
=1
2‖y −Φw‖22 +
λ
2‖w‖22
=1
2‖y −ΦΦTα‖22 +
λ
2‖ΦTα‖22
=1
2‖y −Kα‖22 +
λ
2αTΦΦTα
∝ 1
2αTKTKα− yTKα+
λ
2αTKα
=1
2αTK2α− (Ky)Tα+
λ
2αTKα = J(α)
where we have used the property that K is symmetric.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 33 / 44
![Page 66: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/66.jpg)
The derivation of J(α)
J(w) =1
2
∑n
(y −wTφ(xn))2 +
λ
2‖w‖22
=1
2‖y −Φw‖22 +
λ
2‖w‖22
=1
2‖y −ΦΦTα‖22 +
λ
2‖ΦTα‖22
=1
2‖y −Kα‖22 +
λ
2αTΦΦTα
∝ 1
2αTKTKα− yTKα+
λ
2αTKα
=1
2αTK2α− (Ky)Tα+
λ
2αTKα = J(α)
where we have used the property that K is symmetric.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 33 / 44
![Page 67: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/67.jpg)
The derivation of J(α)
J(w) =1
2
∑n
(y −wTφ(xn))2 +
λ
2‖w‖22
=1
2‖y −Φw‖22 +
λ
2‖w‖22
=1
2‖y −ΦΦTα‖22 +
λ
2‖ΦTα‖22
=1
2‖y −Kα‖22 +
λ
2αTΦΦTα
∝ 1
2αTKTKα− yTKα+
λ
2αTKα
=1
2αTK2α− (Ky)Tα+
λ
2αTKα = J(α)
where we have used the property that K is symmetric.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 33 / 44
![Page 68: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/68.jpg)
The derivation of J(α)
J(w) =1
2
∑n
(y −wTφ(xn))2 +
λ
2‖w‖22
=1
2‖y −Φw‖22 +
λ
2‖w‖22
=1
2‖y −ΦΦTα‖22 +
λ
2‖ΦTα‖22
=1
2‖y −Kα‖22 +
λ
2αTΦΦTα
∝ 1
2αTKTKα− yTKα+
λ
2αTKα
=1
2αTK2α− (Ky)Tα+
λ
2αTKα = J(α)
where we have used the property that K is symmetric.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 33 / 44
![Page 69: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/69.jpg)
Optimal α
∂J(α)
∂α=K2α−Ky + λKα = 0
which leads to (assuming that K is invertible)
α = (K + λI)−1y
Note that we only need to know K in order to compute α — the exactform of φ(·) is not essential — as long as we know how to get innerproducts φ(xm)
Tφ(xn). That observation will give rise to the use ofkernel function.
Note that computing the parameter vector does require knowledge of Φ
wmap = ΦT(K + λI)−1y
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 34 / 44
![Page 70: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/70.jpg)
Optimal α
∂J(α)
∂α=K2α−Ky + λKα = 0
which leads to (assuming that K is invertible)
α = (K + λI)−1y
Note that we only need to know K in order to compute α — the exactform of φ(·) is not essential — as long as we know how to get innerproducts φ(xm)
Tφ(xn). That observation will give rise to the use ofkernel function.
Note that computing the parameter vector does require knowledge of Φ
wmap = ΦT(K + λI)−1y
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 34 / 44
![Page 71: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/71.jpg)
Computing prediction needs only inner products too!
Given a test point φ(x), we must compute:
wTφ(x) = yT(K + λI)−1Φφ(x)
= yT(K + λI)−1
φ(x1)
Tφ(x)φ(x2)
Tφ(x)...
φ(xN )Tφ(x)
= yT(K + λI)−1kx
where we have used the property that (K + λI)−1 is symmetric (as K is)and use kx as a shorthand notation for the column vector.Note that, to make a prediction, once again, we only need to know how toget φ(xn)
Tφ(x).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 35 / 44
![Page 72: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/72.jpg)
Computing prediction needs only inner products too!
Given a test point φ(x), we must compute:
wTφ(x) = yT(K + λI)−1Φφ(x)
= yT(K + λI)−1
φ(x1)
Tφ(x)φ(x2)
Tφ(x)...
φ(xN )Tφ(x)
= yT(K + λI)−1kx
where we have used the property that (K + λI)−1 is symmetric (as K is)and use kx as a shorthand notation for the column vector.Note that, to make a prediction, once again, we only need to know how toget φ(xn)
Tφ(x).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 35 / 44
![Page 73: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/73.jpg)
Inner products between features
Due to their central roles, let us examine more closely the inner productsφ(xm)
Tφ(xn) for a pair of data points xm and xn.
Polynomial-based nonlinear basis functions consider the followingφ(x):
φ : x→ φ(x) =
x21√2x1x2x22
This gives rise to an inner product in a special form,
φ(xm)Tφ(xn) = x2m1x
2n1 + 2xm1xm2xn1xn2 + x2m2x
2n2
= (xm1xn1 + xm2xn2)2 = (xT
mxn)2
Namely, the inner product can be computed by a function (xTmxn)
2
defined in terms of the original features, without knowing φ(·).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 36 / 44
![Page 74: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/74.jpg)
Inner products between features
Due to their central roles, let us examine more closely the inner productsφ(xm)
Tφ(xn) for a pair of data points xm and xn.
Polynomial-based nonlinear basis functions consider the followingφ(x):
φ : x→ φ(x) =
x21√2x1x2x22
This gives rise to an inner product in a special form,
φ(xm)Tφ(xn) = x2m1x
2n1 + 2xm1xm2xn1xn2 + x2m2x
2n2
= (xm1xn1 + xm2xn2)2 = (xT
mxn)2
Namely, the inner product can be computed by a function (xTmxn)
2
defined in terms of the original features, without knowing φ(·).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 36 / 44
![Page 75: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/75.jpg)
Common kernel functions
Polynomial kernel function with degree of d
k(xm,xn) = (xTmxn + c)d
for c ≥ 0 and d is a positive integer.
Gaussian kernel, RBF kernel, or Gaussian RBF kernel
k(xm,xn) = e−‖xm−xn‖22/2σ2
Shift-invariant kernel (only depends on difference between two inputs)
Corresponds to a feature space with infinite dimensions (but we canwork directly with the original features)!
These kernels have hyperparameters to be tuned: d, c, σ2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 37 / 44
![Page 76: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/76.jpg)
Common kernel functions
Polynomial kernel function with degree of d
k(xm,xn) = (xTmxn + c)d
for c ≥ 0 and d is a positive integer.
Gaussian kernel, RBF kernel, or Gaussian RBF kernel
k(xm,xn) = e−‖xm−xn‖22/2σ2
Shift-invariant kernel (only depends on difference between two inputs)
Corresponds to a feature space with infinite dimensions (but we canwork directly with the original features)!
These kernels have hyperparameters to be tuned: d, c, σ2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 37 / 44
![Page 77: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/77.jpg)
Common kernel functions
Polynomial kernel function with degree of d
k(xm,xn) = (xTmxn + c)d
for c ≥ 0 and d is a positive integer.
Gaussian kernel, RBF kernel, or Gaussian RBF kernel
k(xm,xn) = e−‖xm−xn‖22/2σ2
Shift-invariant kernel (only depends on difference between two inputs)
Corresponds to a feature space with infinite dimensions (but we canwork directly with the original features)!
These kernels have hyperparameters to be tuned: d, c, σ2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 37 / 44
![Page 78: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/78.jpg)
Kernel functions
Definition: a (positive semidefinite) kernel function k(·, ·) is a bivariatefunction that satisfies the following properties. For any xm and xn,
k(xm,xn) = k(xn,xm) and k(xm,xn) = φ(xm)Tφ(xn)
for some function φ(·).
Examples we have seen
k(xm,xn) = (xTmxn)
2
k(xm,xn) = e−‖xm−xn‖22/2σ2
Example that is not a kernel
k(xm,xn) = ‖xm − xn‖22
(we’ll see why later)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 38 / 44
![Page 79: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/79.jpg)
Kernel functions
Definition: a (positive semidefinite) kernel function k(·, ·) is a bivariatefunction that satisfies the following properties. For any xm and xn,
k(xm,xn) = k(xn,xm) and k(xm,xn) = φ(xm)Tφ(xn)
for some function φ(·).
Examples we have seen
k(xm,xn) = (xTmxn)
2
k(xm,xn) = e−‖xm−xn‖22/2σ2
Example that is not a kernel
k(xm,xn) = ‖xm − xn‖22
(we’ll see why later)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 38 / 44
![Page 80: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/80.jpg)
Kernel functions
Definition: a (positive semidefinite) kernel function k(·, ·) is a bivariatefunction that satisfies the following properties. For any xm and xn,
k(xm,xn) = k(xn,xm) and k(xm,xn) = φ(xm)Tφ(xn)
for some function φ(·).
Examples we have seen
k(xm,xn) = (xTmxn)
2
k(xm,xn) = e−‖xm−xn‖22/2σ2
Example that is not a kernel
k(xm,xn) = ‖xm − xn‖22
(we’ll see why later)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 38 / 44
![Page 81: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/81.jpg)
Conditions for being a positive semidefinite kernel function
Mercer theorem (loosely), a bivariate function k(·, ·) is a positivesemidefinite kernel function, if and only if, for any N and any x1, x2, . . .,and xN , the matrix
K =
k(x1,x1) k(x1,x2) · · · k(x1,xN )k(x2,x1) k(x2,x2) · · · k(x2,xN )
......
......
k(xN ,x1) k(xN ,x2) · · · k(xN ,xN )
is positive semidefinite. We also refer k(·, ·) as a positive semidefinitekernel.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 39 / 44
![Page 82: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/82.jpg)
Why ‖xm − xn‖22 is not a positive semidefinite kernel?
Use the definition of positive semidefinite kernel function. We chooseN = 2, and compute the matrix
K =
(0 ‖x1 − x2‖22
‖x1 − x2‖22 0
)This matrix cannot be positive semidefinite as it has both negative andpositive eigenvalues (the sum of the diagonal elements is called the traceof a matrix, which equals to the sum of the matrix’s eigenvalues. In ourcase, the trace is zero.)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 40 / 44
![Page 83: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/83.jpg)
Flashback: why using kernel functions?
without specifying φ(·), the kernel matrix
K =
k(x1,x1) k(x1,x2) · · · k(x1,xN )k(x2,x1) k(x2,x2) · · · k(x2,xN )
......
......
k(xN ,x1) k(xN ,x2) · · · k(xN ,xN )
is exactly the same as
K = ΦΦT
=
φ(x1)
Tφ(x1) φ(x1)Tφ(x2) · · · φ(x1)
Tφ(xN )φ(x2)
Tφ(x1) φ(x2)Tφ(x2) · · · φ(x2)
Tφ(xN )· · · · · · · · · · · ·
φ(xN )Tφ(x1) φ(xN )
Tφ(x2) · · · φ(xN )Tφ(xN )
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 41 / 44
![Page 84: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/84.jpg)
‘Kernel trick’
Many learning methods depend on computing inner products betweenfeatures — we have seen the example of regularized least squares. Forthose methods, we can use a kernel function in the place of the innerproducts, i.e., “kernelizing” the methods, thus, introducing nonlinearfeatures.
We will present one more to illustrate this “trick” by kernelizing thenearest neighbor classifier.
When we talk about support vector machines, we will see the trick onemore time.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 42 / 44
![Page 85: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/85.jpg)
Kernelized nearest neighborsIn nearest neighbor classification, the most important quantity to computeis the (squared) distance between two data points xm and xn
d(xm,xn) = ‖xm − xn‖22 = xTmxm + xT
nxn − 2xTmxn
We replace all the inner products in the distance with a kernel functionk(·, ·), arriving at the kernel distance
dkernel(xm,xn) = k(xm,xm) + k(xn,xn)− 2k(xm,xn)
The distance is equivalent to compute the distance between φ(xm) andφ(xn)
dkernel(xm,xn) = d(φ(xm),φ(xn))
where the φ(·) is the nonlinear mapping function implied by the kernelfunction. The nearest neighbor of a point x is thus found with
argminn dkernel(x,xn)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 43 / 44
![Page 86: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/86.jpg)
Kernelized nearest neighborsIn nearest neighbor classification, the most important quantity to computeis the (squared) distance between two data points xm and xn
d(xm,xn) = ‖xm − xn‖22 = xTmxm + xT
nxn − 2xTmxn
We replace all the inner products in the distance with a kernel functionk(·, ·), arriving at the kernel distance
dkernel(xm,xn) = k(xm,xm) + k(xn,xn)− 2k(xm,xn)
The distance is equivalent to compute the distance between φ(xm) andφ(xn)
dkernel(xm,xn) = d(φ(xm),φ(xn))
where the φ(·) is the nonlinear mapping function implied by the kernelfunction. The nearest neighbor of a point x is thus found with
argminn dkernel(x,xn)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 43 / 44
![Page 87: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/87.jpg)
Kernelized nearest neighborsIn nearest neighbor classification, the most important quantity to computeis the (squared) distance between two data points xm and xn
d(xm,xn) = ‖xm − xn‖22 = xTmxm + xT
nxn − 2xTmxn
We replace all the inner products in the distance with a kernel functionk(·, ·), arriving at the kernel distance
dkernel(xm,xn) = k(xm,xm) + k(xn,xn)− 2k(xm,xn)
The distance is equivalent to compute the distance between φ(xm) andφ(xn)
dkernel(xm,xn) = d(φ(xm),φ(xn))
where the φ(·) is the nonlinear mapping function implied by the kernelfunction.
The nearest neighbor of a point x is thus found with
argminn dkernel(x,xn)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 43 / 44
![Page 88: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/88.jpg)
Kernelized nearest neighborsIn nearest neighbor classification, the most important quantity to computeis the (squared) distance between two data points xm and xn
d(xm,xn) = ‖xm − xn‖22 = xTmxm + xT
nxn − 2xTmxn
We replace all the inner products in the distance with a kernel functionk(·, ·), arriving at the kernel distance
dkernel(xm,xn) = k(xm,xm) + k(xn,xn)− 2k(xm,xn)
The distance is equivalent to compute the distance between φ(xm) andφ(xn)
dkernel(xm,xn) = d(φ(xm),φ(xn))
where the φ(·) is the nonlinear mapping function implied by the kernelfunction. The nearest neighbor of a point x is thus found with
argminn dkernel(x,xn)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 43 / 44
![Page 89: Bias / Variance Analysis, Kernel Methods](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1c9a61276e740b0f788b/html5/thumbnails/89.jpg)
There are infinite numbers of kernels to use!
Rules of composing kernels (this is just a partial list)
if k(xm,xn) is a kernel, then ck(xm,xn) is also if c > 0.
if both k1(xm,xn) and k2(xm,xn) are kernels, thenαk1(xm,xn) + βk2(xm,xn) are also if α, β ≥ 0
if both k1(xm,xn) and k2(xm,xn) are kernels, thenk1(xm,xn)k2(xm,xn) are also.
if k(xm,xn) is a kernel, then ek(xm,xn) is also.
· · ·In practice, choosing an appropriate kernel is an “art”
People typically start with polynomial and Gaussian RBF kernels orincorporate domain knowledge.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 29, 2015 44 / 44