PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to...

37
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro

Transcript of PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to...

Page 1: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

CMU-Q 15-381Lecture 24:

Supervised Learning 2

Teacher:

Gianni A. Di Caro

Page 2: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

SUPERVISED LEARNING

2

minimize๐œฝ

1

๐‘š

๐‘–=1

๐‘š

โ„“ โ„Ž๐œƒ ๐’™(๐‘–) , ๐‘ฆ(๐‘–)

Given a collection of input features and outputs โ„Ž๐œƒ ๐’™(๐‘–) , ๐‘ฆ(๐‘–) , ๐‘– = 1,โ€ฆ ,๐‘š and a hypothesis

function โ„Ž๐œƒ, find parameters values ๐œฝ that minimize the average empirical error:

We need to specify:

1. The hypothesis class ๐“—, ๐’‰๐œฝ โˆˆ ๐“—

2. The loss function โ„“

3. The algorithm for solving the optimization problem (often approximately)

4. A complete ML design: from data processing to learning to validation and testing

Labeled

Given

ErrorsPerformance

criteria

Hypotheses space

Hypothesis function

Page 3: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

CLASSIFICATION AND REGRESSION

3

Complex

boundaries, relations

Which hypothesis class โ„‹?

Classification: (Width, Lightness)

โ†“{Salmon, Sea bass} (discrete)

Regression: (Width, Lightness)

โ†“Weight (continuous)

โ„Ž๐œƒ ๐’™ : ๐‘‹ โŠ† โ„2 โ†’ ๐‘Œ = {0,1} โ„Ž๐œƒ ๐’™ : ๐‘‹ โŠ† โ„2 โ†’ ๐‘Œ โŠ† โ„

Features:

Width, Lightness

Page 4: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

PROBABILISTIC MODELS: DISCRIMINATIVE VS. GENERATIVE

4

Discriminative models:

Directly learn ๐‘ ๐‘ฆ ๐’™)

Parametric hypothesis

Allow to discriminate between

classes / predicted outputs

Generative models / Probability distributions:

Learn ๐‘(๐’™, ๐‘ฆ), the probabilistic model that

describes the data, then use Bayesโ€™ rule

Allow to generate data any relevant data

Regression and classification problems can be stated in probabilistic terms (later)

The mapping ๐‘ฆ = โ„Ž๐œƒ ๐’™ that we are learning can be naturally interpreted as the

probability of the output being ๐‘ฆ given the input data ๐’™ (under the selected

hypothesis โ„Ž and the learned parameter vector ๐œƒ)

๐‘ ๐‘ฆ ๐’™) =๐‘ ๐’™ ๐‘ฆ)๐‘(๐‘ฆ)

๐‘(๐’™)=๐‘(๐’™, ๐‘ฆ)

๐‘(๐’™)๐‘ฅ2

๐‘ฅ1๐‘ฅ2

= salmon

= sea bass

๐‘ฅ1

Page 5: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

GENERATIVE MODELS

5

A generative approach would proceed as follows:

1. By looking at the feature data about salmons, build

a model of a salmon

2. By looking at the feature data about sea basses,

build a model of a sea bass

A discriminative model, that learn learns ๐‘ ๐‘ฆ ๐’™; ๐œฝ), can be used to label the

data, to discriminate the data, but not to generate the data

o E.g., a discriminative approach tries to find out

which (linear, in this case) decision boundary

allows for the best classification based on the

training data, and takes decisions accordingly

o Direct learning of the mapping from ๐‘‹ to ๐‘Œ

3. To classify a new fish based on its features ๐’™, we can match it against the

salmon and the sea bass models, to see whether it looks more like the

salmons or more like the sea basses we had seen in the training set

1,2,3 is equivalent to model ๐‘ ๐’™ ๐‘ฆ), where ๐‘ฆ = {๐œ”1, ๐œ”2}: the conditional

probability that the observed features ๐’™ are those of a salmon or a sea bass

Page 6: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

GENERATIVE MODELS

6

๐‘ ๐‘ฆ ๐’™) =๐‘ ๐’™ ๐‘ฆ)๐‘(๐‘ฆ)

๐‘(๐’™)=๐‘(๐’™, ๐‘ฆ)

๐‘(๐’™)

๐‘ ๐’™ ๐‘ฆ = ๐œ”1) models the distribution of salmonโ€™s features

๐‘ ๐’™ ๐‘ฆ = ๐œ”2) models the distribution of sea bassโ€™ features

๐‘(๐‘ฆ) can be derived from the dataset or from other sources

o E.g., ๐‘(๐œ”1) = ratio of salmons in the dataset, ๐‘(๐œ”2) = ratio of sea basses

Bayes rule:

๐‘ ๐’™ = ๐‘ ๐’™ ๐‘ฆ = ๐œ”1)๐‘ ๐‘ฆ = ๐œ”1 + ๐‘ ๐’™ ๐‘ฆ = ๐œ”2)๐‘(๐‘ฆ = ๐œ”2)

To make a prediction:

argmax๐‘ฆ

๐‘ ๐‘ฆ ๐’™) = argmax๐‘ฆ

๐‘ ๐’™ ๐‘ฆ)๐‘(๐‘ฆ)

๐‘(๐’™)= argmax

๐‘ฆ๐‘ ๐’™ ๐‘ฆ)๐‘(๐‘ฆ)

๐‘๐‘œ๐‘ ๐‘ก๐‘’๐‘Ÿ๐‘–๐‘œ๐‘Ÿ =๐‘™๐‘–๐‘˜๐‘’๐‘™๐‘–โ„Ž๐‘œ๐‘œ๐‘‘ ร— ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ

๐‘’๐‘ฃ๐‘–๐‘‘๐‘’๐‘›๐‘๐‘’

Equivalent to: decide ๐œ”1 if ๐‘ ๐œ”1 ๐’™) > ๐‘ ๐œ”2 ๐’™), otherwise decide ๐œ”2

Page 7: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

GENERATIVE MODELS AND BAYES DECISION RULE

7

Likelihood ratio

Two disconnected regions for class 2

Decide ๐œ”1 if ๐‘ ๐’™ ๐œ”1)๐‘ ๐œ”1 > ๐‘ ๐’™ ๐œ”2)๐‘(๐œ”2) otherwise decide ๐œ”2

Decide ๐œ”1 if ๐‘ ๐’™ ๐œ”1)

๐‘ ๐’™ ๐œ”2)>

๐‘(๐œ”2)

๐‘(๐œ”1)otherwise decide ๐œ”2

Page 8: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

GENERATIVE MODELS

8

Given the joint distribution we can generate any conditional or marginal probability

Sample from ๐‘(๐’™, ๐‘ฆ) to obtain labeled data points

Given the priors ๐‘(๐‘ฆ), sample a class or a predictor value

Given the class ๐‘ฆ, sample instance data ๐‘ ๐’™ ๐‘ฆ) of that class, or, given a

predictor variable sample an expected output

Downside: higher complexity, more parameters to learn

Density estimation problem:

Parametric (e.g., Gaussian densities)

Non-parametric (full density estimation)

Page 9: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

9

LETโ€™S GO BACK TO LINEAR REGRESSIONโ€ฆ

Linear model as hypothesis:

๐‘ฆ = โ„Ž ๐’™;๐’˜ = ๐‘ค0 + ๐‘ค1๐‘ฅ1 + ๐‘ค2๐‘ฅ2 +โ‹ฏ+ ๐‘ค๐‘‘๐‘ฅ๐‘‘ = ๐’˜๐‘‡ โˆ™ ๐’™

๐’™ = (1, ๐‘ฅ1, ๐‘ฅ2, โ‹ฏ , ๐‘ฅ๐‘‘ )

โ„Ž

Find ๐’˜ that minimizes the deviation from the

desired answers: ๐‘ฆ(๐‘–) โ‰ˆ โ„Ž ๐’™ ๐‘– , โˆ€๐‘– in dataset

Loss function: Mean squared error (MSE)

โ„“ =1

๐‘š

๐‘–=1

๐‘š

๐‘ฆ ๐‘– โˆ’ โ„Ž ๐’™ ๐‘–2

The model does not try to explain variation in observed ๐‘ฆs for the data

Page 10: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

10

STATISTICAL MODEL FOR LINEAR REGRESSION

A statistical model of linear regression: ๐‘ฆ = ๐’˜๐‘‡๐’™ + ๐œ€

๐ธ ๐‘ฆ ๐‘ฅ = ๐’˜๐‘‡๐’™

The model does explain variation

in observed ๐‘ฆs for the data in

terms of a white Gaussian noise

๐œ€ ~ ๐‘(0, ๐œŽ2 )

๐‘ฆ ~ ๐‘(๐’˜๐‘‡๐’™, ๐œŽ2 )

The conditional distribution of ๐‘ฆ given ๐’™:

๐‘ ๐‘ฆ ๐’™;๐’˜, ๐œŽ) =1

๐œŽ 2๐œ‹exp โˆ’

1

2๐œŽ2๐‘ฆ โˆ’ ๐’˜๐‘‡๐’™ 2

Probability of the output being ๐‘ฆ given the predictor ๐’™

Page 11: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

11

STATISTICAL MODEL FOR LINEAR REGRESSION

Letโ€™s consider the entire data set ๐”‡, and letโ€™s assume that all samples are

independent and identically distributed (i.i.d.) random variables

What is the joint probability of all training data? That is, the probability of observing

all the outputs ๐‘ฆ in ๐”‡ given ๐’˜ and ๐œŽ?

๐‘ ๐‘ฆ(1), ๐‘ฆ(2), โ‹ฏ , ๐‘ฆ(๐‘š) ๐’™(1), ๐’™(2), โ‹ฏ , ๐’™(๐‘š) ; ๐’˜, ๐œŽ)

By iid:

๐‘ ๐‘ฆ(1), ๐‘ฆ(2), โ‹ฏ , ๐‘ฆ(๐‘š) ๐’™(1), ๐’™(2), โ‹ฏ , ๐’™(๐‘š) ; ๐’˜, ๐œŽ) =เท‘

๐‘–=1

๐‘š

๐‘ ๐‘ฆ ๐‘– ๐’™(๐‘–); ๐’˜, ๐œŽ)

Maximum likelihood estimation of the parameters ๐’˜: parameter values

maximizing the likelihood of the predictions, the value of the parameters such

that the probability of observing the data in ๐”‡ is maximized

๐ฟ(๐”‡, ๐’˜, ๐œŽ) = ฯ‚๐‘–=1๐‘š ๐‘ ๐‘ฆ ๐‘– ๐’™(๐‘–); ๐’˜, ๐œŽ) Likelihood function of predictions, the

probability of observing the outputs ๐‘ฆ in ๐”‡given ๐’˜ and ๐œŽ

๐’˜โˆ— = argmax๐’˜

๐ฟ(๐”‡, ๐’˜, ๐œŽ)

Page 12: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

12

STATISTICAL MODEL FOR LINEAR REGRESSION

Log-Likelihood:

๐‘™ ๐”‡, ๐’˜, ๐œŽ = log(๐ฟ(๐”‡, ๐’˜, ๐œŽ)) = log ฯ‚๐‘–=1๐‘š ๐‘ ๐‘ฆ ๐‘– ๐’™(๐‘–); ๐’˜, ๐œŽ)

=

๐‘–=1

๐‘š

log ๐‘ ๐‘ฆ ๐‘– ๐’™(๐‘–); ๐’˜, ๐œŽ)

Using the conditional density: ๐‘ ๐‘ฆ ๐’™;๐’˜, ๐œŽ) =1

๐œŽ 2๐œ‹exp โˆ’

1

2๐œŽ2๐‘ฆ โˆ’ ๐’˜๐‘‡๐’™ 2

๐‘™ ๐”‡, ๐’˜, ๐œŽ =

๐‘–=1

๐‘š

log1

๐œŽ 2๐œ‹exp โˆ’

1

2๐œŽ2๐‘ฆ ๐‘– โˆ’๐’˜๐‘‡๐’™(๐‘–)

2=

๐‘–=1

๐‘š

โˆ’1

2๐œŽ2๐‘ฆ ๐‘– โˆ’๐’˜๐‘‡๐’™(๐‘–)

2โˆ’ ๐‘(๐œŽ)

= โˆ’1

2๐œŽ2

๐‘–=1

๐‘š

๐‘ฆ ๐‘– โˆ’๐’˜๐‘‡๐’™(๐‘–)2+ ๐‘(๐œŽ)

Does it look familiar?

Maximizing the predictive log-likelihood

with regard to ๐’˜, is equivalent to

minimizing the MSE loss function

max๐’˜

๐‘™ ๐”‡, ๐’˜, ๐œŽ ~min๐’˜

๐‘€๐‘†๐ธ

More in general, least squares linear fit under Gaussian noise corresponds to

the maximum likelihood estimator of the data

Page 13: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

13

NON-LINEAR, ADDITIVE REGRESSION MODELS

Page 14: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

NON-LINEAR PROBLEMS?

14

Design a non-linear regressor / classifier

Modify the input data to make the problem linear

Page 15: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

MAP DATA IN HIGHER DIMENSIONALITY FEATURE SPACES

15

Page 16: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

MAP DATA IN HIGHER DIMENSIONALITY FEATURE SPACES

16

The property of the solution of SVMs (that are in terms of dot products between

feature vectors) allows to easily define a kernel function that implicitly perform

the desired transformation, allowing keeping using linear classifiers โ€ฆ.

The hyperplane is found in ๐’›-space, then

projected back in ๐’™-space, where is an ellipsis

Page 17: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

17

NON-LINEAR, ADDITIVE REGRESSION MODELS

Main idea to model nonlinearities: Replace inputs to linear units with ๐‘ feature

(basis) functions ๐œ™๐‘— ๐’™ , ๐‘— = 1,โ‹ฏ , ๐‘, where ๐œ™๐‘— ๐’™ is an arbitrary function of ๐’™

๐‘ฆ = โ„Ž ๐’™;๐’˜ = ๐‘ค0 + ๐‘ค1๐œ™1 ๐’™ + ๐‘ค2๐œ™2 ๐’™ +โ‹ฏ+ ๐‘ค๐‘๐œ™๐‘ ๐’™ = ๐’˜๐‘‡ โˆ™ ๐“(๐’™)

โ„Ž

๐‘

๐‘

Original

feature

input

New input Linear model

Page 18: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

18

EXAMPLES OF FEATURE FUNCTIONS

Higher order polynomial with one-dimensional input, ๐’™ = (๐‘ฅ)

๐œ™1 ๐’™ = ๐‘ฅ, ๐œ™2 ๐’™ = ๐‘ฅ2, ๐œ™3 ๐’™ = ๐‘ฅ3, โ‹ฏ

Quadratic polynomial with two-dimensional inputs, ๐’™ = (๐‘ฅ1, ๐‘ฅ2)

๐œ™1 ๐’™ = ๐‘ฅ1, ๐œ™2 ๐’™ = ๐‘ฅ12, ๐œ™3 ๐’™ = ๐‘ฅ2, ๐œ™4 ๐’™ = ๐‘ฅ2

2, ๐œ™3 ๐’™ = ๐‘ฅ1๐‘ฅ2

Transcendent functions:

๐œ™1 ๐’™ = sin(๐‘ฅ), ๐œ™2 ๐’™ = cos(๐‘ฅ)

โ€ฆ

Page 19: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

19

SOLUTION USING FEATURE FUNCTIONS

The same techniques (analytical gradient + system of equations, or gradient

descent) used for the plain linear case with MSE as loss function

๐“ ๐’™ ๐‘– = (1, ๐œ™1 ๐’™ ๐‘– , ๐œ™2 ๐’™ ๐‘– , โ‹ฏ , ๐œ™๐‘(๐’™๐‘– ))

โ„“ =1

๐‘š

๐‘–=1

๐‘š

๐‘ฆ ๐‘– โˆ’ โ„Ž ๐’™ ๐‘–2

To find min๐’˜

โ„“ we have to look where ๐›ป๐’˜ โ„“ = 0

๐›ป๐’˜ โ„“ = โˆ’2

๐‘š

๐‘–=1

๐‘š

๐‘ฆ ๐‘– โˆ’ โ„Ž ๐’™ ๐‘– ๐“ ๐’™ ๐‘– = ๐ŸŽ

โ„Ž ๐’™ ๐‘– ; ๐’˜ = ๐‘ค0 + ๐‘ค1๐œ™1 ๐’™ ๐‘– + ๐‘ค2๐œ™2 ๐’™ ๐‘– +โ‹ฏ+๐‘ค๐‘๐œ™๐‘ ๐’™ ๐‘– = ๐’˜๐‘‡ โˆ™ ๐“(๐’™ ๐‘– )

Results in a system of ๐‘ linear equations:

๐‘ค0

๐‘–=1

๐‘š

1๐œ™๐‘— ๐’™ ๐‘– + ๐‘ค1

๐‘–=1

๐‘š

๐œ™1 ๐’™ ๐‘– ๐œ™๐‘— ๐’™ ๐‘– + โ‹ฏ+๐‘ค๐‘˜

๐‘–=1

๐‘š

๐œ™๐‘˜ ๐’™ ๐‘– ๐œ™๐‘— ๐’™ ๐‘– โ‹ฏ+๐‘ค๐‘

๐‘–=1

๐‘š

๐œ™๐‘ ๐’™ ๐‘– ๐œ™๐‘— ๐’™ ๐‘–

=

๐‘–=1

๐‘š

๐‘ฆ๐‘–๐œ™๐‘— ๐’™ ๐‘– โˆ€๐‘— = 1,โ‹ฏ , ๐‘

Page 20: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

20

EXAMPLE OF SDG WITH FEATURE FUNCTIONS

One dimensional feature vectors and high-order polynomial: ๐’™ = ๐‘ฅ , ๐œ™๐‘– ๐’™ = ๐‘ฅ๐‘–

โ„Ž ๐’™;๐’˜ = ๐‘ค0 +๐‘ค1๐œ™1 ๐’™ + ๐‘ค2๐œ™2 ๐’™ +โ‹ฏ+ ๐‘ค๐‘๐œ™๐‘ ๐’™ = ๐‘ค0 +

๐‘–=1

๐‘

๐‘ค๐‘– ๐‘ฅ๐‘–

On-line, single sample, (๐’™ ๐‘– , ๐‘ฆ ๐‘– ), gradient update, โˆ€๐‘— = 1,โ‹ฏ , ๐‘

๐‘ค๐‘— = ๐‘ค๐‘— + ๐›ผ๐›ป๐’˜ โ„“ โ„Ž ๐’™ ๐‘– ; ๐’˜ , ๐‘ฆ ๐‘– = ๐‘ค๐‘— + ๐›ผ ๐‘ฆ ๐‘– โˆ’ โ„Ž ๐’™ ๐‘– ๐œ™๐‘— ๐’™ ๐‘–

Same form as in the linear regression model, with ๐’™๐‘—(๐‘–)

โ†’ ๐œ™๐‘— ๐’™ ๐‘–

Page 21: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

ELECTRICITY EXAMPLE

21

New data: it doesnโ€™t look

linear anymore

Page 22: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

22

NEW HYPOTHESIS

The complexity of the model grows: one parameter for each feature transformed

according to a polynomial of order 2 (at least 3 parameters vs. 2 of original hypothesis)

Page 23: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

23

NEW HYPOTHESIS

At least 5 parameters (if we had multiple predicting features, all their order d

products should be considered, resulting into a number of additional parameters)

Page 24: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

24

NEW HYPOTHESIS

The number of parameters is now larger than the data points, such that the

polynomial can almost precisely fit the data Overfitting

Page 25: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

25

SELECTING MODEL COMPLEXITY

Dataset with 10 points, 1D features: which hypothesis class should we use?

Linear regression: ๐‘ฆ = โ„Ž ๐‘ฅ;๐’˜ = ๐‘ค0 + ๐‘ค1๐‘ฅ

Polynomial regression, cubic: ๐‘ฆ = โ„Ž ๐‘ฅ;๐’˜ = ๐‘ค0 + ๐‘ค1๐‘ฅ + ๐‘ค2๐‘ฅ2 + ๐‘ค3๐‘ฅ

3

MSE for the loss functions

Which model would give the smaller error in terms of MSE / least squares fit?

Page 26: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

26

SELECTING MODEL COMPLEXITY

Cubic regression provides a better fit to the data, and a smaller MSE

Should we stick with the hypothesis โ„Ž ๐‘ฅ;๐’˜ = ๐‘ค0 + ๐‘ค1๐‘ฅ + ๐‘ค2๐‘ฅ2 + ๐‘ค3๐‘ฅ

3 ?

Since a higher order polynomial seems to provide a better fit, why donโ€™t we

use a polynomial of order higher than 3?

What is the highest order that makes sense for the given problem?

Page 27: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

27

SELECTING MODEL COMPLEXITY

For 10 data points, a degree 9 polynomial gives a perfect fit (Lagrange

interpolation). Error is zero.

Is it always good to minimize (even reduce to zero) the training error?

Related (and more important) question: How do we (will) perform on new,

unseen data?

Page 28: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

28

OVERFITTING

The 9-polynomial model totally fails the prediction for the new point!

Overfitting: Situation when the training error is low and the generalization error

is high. Causes of the phenomenon:

Highly complex hypothesis model, with a large number of parameters

(degrees of freedom)

Small data size (as compared to the complexity of the model)

The learned function has enough degrees of freedom to (over)fit all data perfectly

Page 29: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

29

OVERFITTING

Empirical loss vs. Generalization loss

Page 30: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

30

TRAINING AND VALIDATION LOSS

Page 31: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

31

SPLITTING DATASET IN TWO

Page 32: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

32

PERFORMANCE ON VALIDATION SET

Page 33: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

33

PERFORMANCE ON VALIDATION SET

Page 34: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

34

INCREASING MODEL COMPLEXITY

In this case, the small size of the dataset favors an easy overfitting by

increasing the degree of the polynomial (i.e., hypothesis complexity). For

a large multi-dimensional dataset this effect is less strong / evident

Page 35: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

35

TRAINING VS. VALIDATION LOSS

Page 36: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

36

MODEL SELECTION AND EVALUATION PROCESS

1. Break all available data into training and testing sets (e.g., 70% / 30%)

2. Break training set into training and validation sets (e.g., 70% / 30%)

3. Loop:

i. Set a hyperparameter value (e.g., degree of polynomial โ†’ model complexity)

ii. Train the model using training sets

iii. Validate the model using validation sets

iv. Exit loop if (validation errors keep growing && training errors go to zero)

4. Choose hyperparameters using validation set results: hyperparameter values

corresponding to lowest validation errors

5. (Optional) With the selected hyperparameters, retrain the model using all training

data sets

6. Evaluate (generalization) performance on the testing sets

(more on this next time)

Page 37: PowerPoint Presentation - Carnegie Mellon Universitygdicaro/15381/slides/381-F18...โ„“we have to look where โ„“=0 โ„“=โˆ’ 2 =1 ๐‘š โˆ’โ„Ž ๐“ =๐ŸŽ โ„Ž ; = 0+ 1๐œ™1 + 2๐œ™2 +โ‹ฏ+

37

MODEL SELECTION AND EVALUATION PROCESS

Dataset

Testing

setTraining

set

Internal

training setValidation

set

Model 1

โ‹ฎ

Learn 1

Learn 2

Learn ๐‘›

Validate 1

Validate 2

Validate ๐‘›

Select

best

model

Model โˆ—

โ‹ฎ โ‹ฎ

Learn โˆ—

Model 2

Model ๐‘›