Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL...

26
CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Poˇ ık c 2015 Artificial Intelligence – 1 / 13 Bias-variance trade-off. Crossvalidation. Regularization. Petr Poˇ ık

Transcript of Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL...

Page 1: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

CZECH TECHNICAL UNIVERSITY IN PRAGUE

Faculty of Electrical EngineeringDepartment of Cybernetics

P. Posık c© 2015 Artificial Intelligence – 1 / 13

Bias-variance trade-off.Crossvalidation. Regularization.

Petr Posık

Page 2: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

How to evaluate a predictive model?

P. Posık c© 2015 Artificial Intelligence – 2 / 13

Page 3: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Model evaluation

P. Posık c© 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?

Page 4: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Model evaluation

P. Posık c© 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?

■ We have various measures of model error:

■ For regression tasks: MSE, MAE, . . .

■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .

■ Some of them can be regarded as finite approximations of the Bayes risk.

■ Are these functions good approximations when measured on the data the models were trained on?

Page 5: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Model evaluation

P. Posık c© 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?

■ We have various measures of model error:

■ For regression tasks: MSE, MAE, . . .

■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .

■ Some of them can be regarded as finite approximations of the Bayes risk.

■ Are these functions good approximations when measured on the data the models were trained on?

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

3f(x) = x

f(x) = x3−3x2+3x

Page 6: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Model evaluation

P. Posık c© 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?

■ We have various measures of model error:

■ For regression tasks: MSE, MAE, . . .

■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .

■ Some of them can be regarded as finite approximations of the Bayes risk.

■ Are these functions good approximations when measured on the data the models were trained on?

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

3f(x) = x

f(x) = x3−3x2+3x

Using MSE only, both models are equivalent!!!

Page 7: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Model evaluation

P. Posık c© 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?

■ We have various measures of model error:

■ For regression tasks: MSE, MAE, . . .

■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .

■ Some of them can be regarded as finite approximations of the Bayes risk.

■ Are these functions good approximations when measured on the data the models were trained on?

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

3f(x) = x

f(x) = x3−3x2+3x

Using MSE only, both models are equivalent!!!

−0.5 0 0.5 1 1.5 2 2.5−0.5

0

0.5

1

1.5

2

2.5f(x) = −0.09 + 0.99x

f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)

Page 8: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Model evaluation

P. Posık c© 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?

■ We have various measures of model error:

■ For regression tasks: MSE, MAE, . . .

■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .

■ Some of them can be regarded as finite approximations of the Bayes risk.

■ Are these functions good approximations when measured on the data the models were trained on?

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

3f(x) = x

f(x) = x3−3x2+3x

Using MSE only, both models are equivalent!!!

−0.5 0 0.5 1 1.5 2 2.5−0.5

0

0.5

1

1.5

2

2.5f(x) = −0.09 + 0.99x

f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)

Using MSE only, the cubic model is better thanlinear!!!

Page 9: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Model evaluation

P. Posık c© 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?

■ We have various measures of model error:

■ For regression tasks: MSE, MAE, . . .

■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .

■ Some of them can be regarded as finite approximations of the Bayes risk.

■ Are these functions good approximations when measured on the data the models were trained on?

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

3f(x) = x

f(x) = x3−3x2+3x

Using MSE only, both models are equivalent!!!

−0.5 0 0.5 1 1.5 2 2.5−0.5

0

0.5

1

1.5

2

2.5f(x) = −0.09 + 0.99x

f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)

Using MSE only, the cubic model is better thanlinear!!!

A basic method of evaluation is model validation on a different, independent data set from the same source, i.e.on testing data.

Page 10: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Validation on testing data

P. Posık c© 2015 Artificial Intelligence – 4 / 13

Example: Polynomial regression with varrying degree:

X ∼ U(−1, 3)

Y ∼ X2 + N(0, 1)

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 0 , t r. e rr.: 8 .3 1 9 , t e s t . e rr.: 6 .9 0 1

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 1 , t r. e rr.: 2 .0 1 3 , t e s t . e rr.: 2 .8 4 1

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 2 , t r. e rr.: 0 .6 4 7 , t e s t . e rr.: 0 .9 2 5

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 3 , t r. e rr.: 0 .6 4 5 , t e s t . e rr.: 0 .9 1 9

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 5 , t r. e rr.: 0 .6 1 1 , t e s t . e rr.: 0 .9 7 9

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 9 , t r. e rr.: 0 .5 4 5 , t e s t . e rr.: 1 .0 6 7

Tra in in g d a ta

Te s t in g d a ta

Page 11: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Training and testing error

How to evaluate apredictive model?

• Model evaluation• Training and testingerror

• Overfitting

• Bias vs Variance

• Crossvalidation• How to determine asuitable modelflexibility

• How to preventoverfitting?

Regularization

P. Posık c© 2015 Artificial Intelligence – 5 / 13

0 2 4 6 8 1 0

Polyn om d e g re e

0

1

2

3

4

5

6

7

8

9

MS

E

Tra in in g e rror

Te s t in g e rror

■ The training error decreases with increasing model flexibility.

■ The testing error is minimal for certain degree of model flexibility.

Page 12: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Overfitting

P. Posık c© 2015 Artificial Intelligence – 6 / 13

Definition of overfitting:

■ Let H be a hypothesis space.

■ Let h ∈ H and h′ ∈ H be 2 different hypotheses fromthis space.

■ Let ErrTr(h) be an error of the hypothesis hmeasured on the training dataset (training error).

■ Let ErrTst(h) be an error of the hypothesis hmeasured on the testing dataset (testing error).

■ We say that h is overfitted if there is another h′ forwhich

ErrTr(h) < ErrTr(h′) ∧ ErrTst(h) > ErrTst(h

′)

Mo

del

Err

or

Model Flexibility

Training data

Testing data

■ “When overfitted, the model works well for the training data, but fails for new (testing) data.”

■ Overfitting is a general phenomenon affecting all kinds of inductive learning.

Page 13: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Overfitting

P. Posık c© 2015 Artificial Intelligence – 6 / 13

Definition of overfitting:

■ Let H be a hypothesis space.

■ Let h ∈ H and h′ ∈ H be 2 different hypotheses fromthis space.

■ Let ErrTr(h) be an error of the hypothesis hmeasured on the training dataset (training error).

■ Let ErrTst(h) be an error of the hypothesis hmeasured on the testing dataset (testing error).

■ We say that h is overfitted if there is another h′ forwhich

ErrTr(h) < ErrTr(h′) ∧ ErrTst(h) > ErrTst(h

′)

Mo

del

Err

or

Model Flexibility

Training data

Testing data

■ “When overfitted, the model works well for the training data, but fails for new (testing) data.”

■ Overfitting is a general phenomenon affecting all kinds of inductive learning.

We want models and learning algorithms with a good generalization ability, i.e.

■ we want models that encode only the patterns valid in the whole domain, not those that learned thespecifics of the training data,

■ we want algorithms able to find only the patterns valid in the whole domain and ignore specifics of thetraining data.

Page 14: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Bias vs Variance

P. Posık c© 2015 Artificial Intelligence – 7 / 13

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 1 , t r. e rr.: 2 .0 1 3 , t e s t . e rr.: 2 .8 4 1

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 2 , t r. e rr.: 0 .6 4 7 , t e s t . e rr.: 0 .9 2 5

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 9 , t r. e rr.: 0 .5 4 5 , t e s t . e rr.: 1 .0 6 7

Tra in in g d a ta

Te s t in g d a ta

High bias:model not flexible enough

(Underfit)

“Just right”(Good fit)

High variance:model flexibility too high

(Overfit)

Page 15: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Bias vs Variance

P. Posık c© 2015 Artificial Intelligence – 7 / 13

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 1 , t r. e rr.: 2 .0 1 3 , t e s t . e rr.: 2 .8 4 1

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 2 , t r. e rr.: 0 .6 4 7 , t e s t . e rr.: 0 .9 2 5

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

x

− 4

− 2

0

2

4

6

8

1 0

y

Polyn om d e g .: 9 , t r. e rr.: 0 .5 4 5 , t e s t . e rr.: 1 .0 6 7

Tra in in g d a ta

Te s t in g d a ta

High bias:model not flexible enough

(Underfit)

“Just right”(Good fit)

High variance:model flexibility too high

(Overfit)

High bias problem:

■ ErrTr(h) is high

■ ErrTst(h) ≈ ErrTr(h) Mo

del

Err

or

Model Flexibility

Training data

Testing data

High variance problem:

■ ErrTr(h) is low

■ ErrTst(h) >> ErrTr(h)

Page 16: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Crossvalidation

How to evaluate apredictive model?

• Model evaluation• Training and testingerror

• Overfitting

• Bias vs Variance

• Crossvalidation• How to determine asuitable modelflexibility

• How to preventoverfitting?

Regularization

P. Posık c© 2015 Artificial Intelligence – 8 / 13

Simple crossvalidation:

■ Split the data into training and testing subsets.

■ Train the model on training data.

■ Evaluate the model error on testing data.

Page 17: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Crossvalidation

How to evaluate apredictive model?

• Model evaluation• Training and testingerror

• Overfitting

• Bias vs Variance

• Crossvalidation• How to determine asuitable modelflexibility

• How to preventoverfitting?

Regularization

P. Posık c© 2015 Artificial Intelligence – 8 / 13

Simple crossvalidation:

■ Split the data into training and testing subsets.

■ Train the model on training data.

■ Evaluate the model error on testing data.

K-fold crossvalidation:

■ Split the data into k folds (k is usually 5 or 10).

■ In each iteration:

■ Use k − 1 folds to train the model.

■ Use 1 fold to test the model, i.e. measure error.

Iter. 1 Training Training TestingIter. 2 Training Testing TrainingIter. k Testing Training Training

■ Aggregate (average) the k error measurements to get the final error estimate.

■ Train the model on the whole data set.

Page 18: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Crossvalidation

How to evaluate apredictive model?

• Model evaluation• Training and testingerror

• Overfitting

• Bias vs Variance

• Crossvalidation• How to determine asuitable modelflexibility

• How to preventoverfitting?

Regularization

P. Posık c© 2015 Artificial Intelligence – 8 / 13

Simple crossvalidation:

■ Split the data into training and testing subsets.

■ Train the model on training data.

■ Evaluate the model error on testing data.

K-fold crossvalidation:

■ Split the data into k folds (k is usually 5 or 10).

■ In each iteration:

■ Use k − 1 folds to train the model.

■ Use 1 fold to test the model, i.e. measure error.

Iter. 1 Training Training TestingIter. 2 Training Testing TrainingIter. k Testing Training Training

■ Aggregate (average) the k error measurements to get the final error estimate.

■ Train the model on the whole data set.

Leave-one-out (LOO) crossvalidation:

■ k = |T|, i.e. the number of folds is equal to the training set size.

■ Time consuming for large |T|.

Page 19: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

How to determine a suitable model flexibility

How to evaluate apredictive model?

• Model evaluation• Training and testingerror

• Overfitting

• Bias vs Variance

• Crossvalidation• How to determine asuitable modelflexibility

• How to preventoverfitting?

Regularization

P. Posık c© 2015 Artificial Intelligence – 9 / 13

Simply test models of varying complexities and choose the one with the best testing error,right?

■ The testing data are used here to tune a meta-parameter of the model.

■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.

■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.

■ A new, separate data set is needed to estimate the model error.

Page 20: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

How to determine a suitable model flexibility

How to evaluate apredictive model?

• Model evaluation• Training and testingerror

• Overfitting

• Bias vs Variance

• Crossvalidation• How to determine asuitable modelflexibility

• How to preventoverfitting?

Regularization

P. Posık c© 2015 Artificial Intelligence – 9 / 13

Simply test models of varying complexities and choose the one with the best testing error,right?

■ The testing data are used here to tune a meta-parameter of the model.

■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.

■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.

■ A new, separate data set is needed to estimate the model error.

Using simple crossvalidation:

1. Training data: use cca 50 % of data for model building.

2. Validation data: use cca 25 % of data to search for the suitable model flexibility.

3. Train the suitable model on training + validation data.

4. Testing data: use cca 25 % of data for the final estimate of the model error.

Page 21: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

How to determine a suitable model flexibility

How to evaluate apredictive model?

• Model evaluation• Training and testingerror

• Overfitting

• Bias vs Variance

• Crossvalidation• How to determine asuitable modelflexibility

• How to preventoverfitting?

Regularization

P. Posık c© 2015 Artificial Intelligence – 9 / 13

Simply test models of varying complexities and choose the one with the best testing error,right?

■ The testing data are used here to tune a meta-parameter of the model.

■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.

■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.

■ A new, separate data set is needed to estimate the model error.

Using simple crossvalidation:

1. Training data: use cca 50 % of data for model building.

2. Validation data: use cca 25 % of data to search for the suitable model flexibility.

3. Train the suitable model on training + validation data.

4. Testing data: use cca 25 % of data for the final estimate of the model error.

Using k-fold crossvalidation

1. Training data: use cca 75 % of data to find and train a suitable model usingcrossvalidation.

2. Testing data: use cca 25 % of data for the final estimate of the model error.

Page 22: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

How to determine a suitable model flexibility

How to evaluate apredictive model?

• Model evaluation• Training and testingerror

• Overfitting

• Bias vs Variance

• Crossvalidation• How to determine asuitable modelflexibility

• How to preventoverfitting?

Regularization

P. Posık c© 2015 Artificial Intelligence – 9 / 13

Simply test models of varying complexities and choose the one with the best testing error,right?

■ The testing data are used here to tune a meta-parameter of the model.

■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.

■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.

■ A new, separate data set is needed to estimate the model error.

Using simple crossvalidation:

1. Training data: use cca 50 % of data for model building.

2. Validation data: use cca 25 % of data to search for the suitable model flexibility.

3. Train the suitable model on training + validation data.

4. Testing data: use cca 25 % of data for the final estimate of the model error.

Using k-fold crossvalidation

1. Training data: use cca 75 % of data to find and train a suitable model usingcrossvalidation.

2. Testing data: use cca 25 % of data for the final estimate of the model error.

The ratios are not set in stone, there are other possibilities, e.g. 60:20:20, etc.

Page 23: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

How to prevent overfitting?

How to evaluate apredictive model?

• Model evaluation• Training and testingerror

• Overfitting

• Bias vs Variance

• Crossvalidation• How to determine asuitable modelflexibility

• How to preventoverfitting?

Regularization

P. Posık c© 2015 Artificial Intelligence – 10 / 13

1. Reduce number of features.

■ Select manually, which features to keep.

■ Try to identify a suitable subset of features during learning phase.

2. Regularization

■ Keep all features, but reduce the magnitude of parameters w.

■ Works well, if we have a lot of features each of which contributes a bit topredicting y.

Page 24: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Regularization

P. Posık c© 2015 Artificial Intelligence – 11 / 13

Page 25: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Ridge regularization (a.k.a. Tikhonov regularization)

P. Posık c© 2015 Artificial Intelligence – 12 / 13

Ridge regularization penalizes the size of themodel coefficients:

■ Modification of the optimization criterion:

J(w) =1

|T|

|T|

∑i=1

(

y(i) − hw(x(i)))2

D

∑d=1

w2d.

■ The solution is given by a modified Normalequation

w∗ = (XT X+αI)−1XTy

■ As α → 0, wridge → wOLS.

■ As α → ∞, wridge → 0.

Training and testing errors as functions ofregularization parameter:

10-8 10-6 10-4 10-2 100 102 104 106 108

Re g u la riza t ion fa c tor

0 .5

1 .0

1 .5

2 .0

2 .5

3 .0

3 .5

MS

E

Tra in in g e rror

Te s t in g e rror

The values of coefficients as functions ofregularization parameter:

10-8 10-6 10-4 10-2 100 102 104 106 108

Re g u la riza t ion fa c tor

− 1 0

− 5

0

5

1 0

1 5

Co

eff

icie

nt

siz

e

Page 26: Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics P. Posˇ´ık

Lasso regularization

P. Posık c© 2015 Artificial Intelligence – 13 / 13

Lasso regularization penalizes the size of themodel coefficients:

■ Modification of the optimization criterion:

J(w) =1

|T|

|T|

∑i=1

(

y(i) − hw(x(i)))2

D

∑d=1

|wd|.

■ Solution is usually found by quadraticprogramming.

■ As α → ∞, Lasso regularization decreases thenumber of non-zero coefficients.

Training and testing errors as functions ofregularization parameter:

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102

Re g u la riza t ion fa c tor

1 .0

1 .2

1 .4

1 .6

1 .8

2 .0

2 .2

2 .4

2 .6

2 .8

MS

E

Tra in in g e rror

Te s t in g e rror

The values of coefficients as functions ofregularization parameter:

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102

Re g u la riza t ion fa c tor

− 0 .1

0 .0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

Co

eff

icie

nt

siz

e