Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… ·...

Post on 21-Jul-2020

0 views 0 download

Transcript of Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… ·...

Simple Linear Regression (single variable)Introduction to Machine Learning

Marek Petrik

January 31, 2017

Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications in R”(Springer, 2013) with permission from the authors: G. James, D. Wien, T. Hastie and R. Tibshirani

Last Class

1. Basic machine learning framework

Y = f(X)

2. Prediction vs inference: predict Y vs understand f

3. Parametric vs non-parametric: linear regression vs k-NN

4. Classification vs regressions: k-NN vs linear regression

5. Why we need to have a test set: overfiing

What is Machine LearningI Discover unknown function f :

Y = f(X)

I X = set of features, or inputsI Y = target, or response

0 50 100 200 300

51

01

52

02

5

TV

Sa

les

0 10 20 30 40 50

51

01

52

02

5

Radio

Sa

les

0 20 40 60 80 100

51

01

52

02

5Newspaper

Sa

les

Sales = f(TV,Radio,Newspaper)

Errors in Machine Learning: World is Noisy

I World is too complex to model preciselyI Many features are not captured in data setsI Need to allow for errors ε in f :

Y = f(X) + ε

How Good are Predictions?

I Learned function fI Test data: (x1, y1), (x2, y2), . . .

I Mean Squared Error (MSE):

MSE =1

n

n∑i=1

(yi − f(xi))2

I This is the estimate of:

MSE = E[(Y − f(X))2] =1

|Ω|∑ω∈Ω

(Y (ω)− f(X(ω)))2

I Important: Samples xi are i.i.d.

Do We Need Test Data?I Why not just test on the training data?

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Me

an

Sq

ua

red

Err

or

I Flexibility is the degree of polynomial being fitI Gray line: training error, red line: testing error

Types of Function f

Regression: continuous target

f : X → R

Years of Education

Sen

iorit

y

Incom

e

Classification: discrete target

f : X → 1, 2, 3, . . . , k

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

Bias-Variance Decomposition

Y = f(X) + ε

Mean Squared Error can be decomposed as:

MSE = E(Y − f(X))2 = Var(f(X))︸ ︷︷ ︸Variance

+ (E(f(X)))2︸ ︷︷ ︸Bias

+ Var(ε)

I Bias: How well would method work with infinite dataI Variance: How much does output change with dierent data

sets

Today

I Basics of linear regressionI Why linear regressionI How to compute itI Why compute it

Simple Linear RegressionI We have only one feature

Y ≈ β0 + β1X Y = β0 + β1X + ε

I Example:

0 50 100 150 200 250 300

510

1520

25

TV

Sal

es

Sales ≈ β0 + β1 × TV

How To Estimate CoeicientsI No line that will have no errors on data xiI Prediction:

yi = β0 + β1xi

I Errors (yi are true values):

ei = yi − yi

0 50 100 150 200 250 300

510

15

20

25

TV

Sale

s

Residual Sum of Squares

I Residual Sum of Squares

RSS = e21 + e2

2 + e23 + · · ·+ e2

n =

n∑i=1

e2i

I Equivalently:

RSS =

n∑i=1

(yi − β0 − β1xi)2

Minimizing Residual Sum of Squares

minβ0,β1

RSS = minβ0,β1

n∑i=1

e2i = min

β0,β1

n∑i=1

(yi − β0 − β1xi)2

RS

S

β1

β0

Minimizing Residual Sum of Squares

minβ0,β1

RSS = minβ0,β1

n∑i=1

e2i = min

β0,β1

n∑i=1

(yi − β0 − β1xi)2

β0

β1

2.15

2.2

2.3

2.5

3

3

3

3

5 6 7 8 9

0.0

30.0

40.0

50.0

6

Solving for Minimal RSS

minβ0,β1

n∑i=1

(yi − β0 − β1xi)2

I RSS is a convex function of β0, β1

I Minimum achieved when (recall the chain rule):

∂ RSS

∂β0= −2

n∑i=1

(yi − β0 − β1xi) = 0

∂ RSS

∂β1= −2

n∑i=1

xi(yi − β0 − β1xi) = 0

Linear Regression Coeicients

minβ0,β1

n∑i=1

(yi − β0 − β1xi)2

Solution:

β0 = y − β1x

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=

∑ni=1 xi(yi − y)∑ni=1 xi(xi − x)

where

x =1

n

n∑i=1

xi y =1

n

n∑i=1

yi

Why Minimize RSS

1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)

2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)

3. It is convenient: can be solved in closed form

Why Minimize RSS

1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)

2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)

3. It is convenient: can be solved in closed form

Why Minimize RSS

1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)

2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)

3. It is convenient: can be solved in closed form

Why Minimize RSS

1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)

2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)

3. It is convenient: can be solved in closed form

Bias in Estimation

I Assume a true value µ?

I Estimate µ is unbiased when E[µ] = µ?

I Standard mean estimate is unbiased (e.g. X ∼ N (0, 1)):

E

[1

n

n∑i=1

Xi

]= 0

I Standard variance estimate is biased (e.g. X ∼ N (0, 1)):

E

[1

n

n∑i=1

(Xi − X)2

]6= 1

Linear Regression is Unbiased

−2 −1 0 1 2

−10

−5

05

10

X

Y

−2 −1 0 1 2

−10

−5

05

10

X

Y

Gauss-Markov Theorem (ESL 3.2.2)

Solution of Linear Regression

0 50 100 150 200 250 300

510

1520

25

TV

Sal

es

How Good is the Fit

I How well is linear regression predicting the training data?I Can we be sure that TV advertising really influences the sales?I What is the probability that we just got lucky?

R2 Statistic

R2 = 1− RSS

TSS= 1−

∑ni=1(yi − yi)2∑ni=1(yi − y)2

I RSS - residual sum of squares, TSS - total sum of squaresI R2 measures the goodness of the fit as a proportionI Proportion of data variance explained by the modelI Extreme values:

0: Model does not explain data1: Model explains data perfectly

Example: TV Impact on Sales

0 50 100 150 200 250 300

510

1520

25

TV

Sal

es

R2 = 0.61

Example: TV Impact on Sales

0 50 100 150 200 250 300

510

1520

25

TV

Sal

es

R2 = 0.61

Example: Radio Impact on Sales

0 10 20 30 40 50

510

1520

25

Radio

Sal

es

R2 = 0.33

Example: Radio Impact on Sales

0 10 20 30 40 50

510

1520

25

Radio

Sal

es

R2 = 0.33

Example: Newspaper Impact on Sales

0 20 40 60 80 100

510

1520

25

Newspaper

Sal

es

R2 = 0.05

Example: Newspaper Impact on Sales

0 20 40 60 80 100

510

1520

25

Newspaper

Sal

es

R2 = 0.05

Correlation Coeicient

I Measures dependence between two random variables X and Y

r =Cov(X,Y )√

Var(X)√

Var(Y )

I Correlation coeicient r is between [−1, 1]

0: Variables are not related1: Variables are perfectly related (same)−1: Variables are negatively related (dierent)

I R2 = r2

Correlation Coeicient

I Measures dependence between two random variables X and Y

r =Cov(X,Y )√

Var(X)√

Var(Y )

I Correlation coeicient r is between [−1, 1]

0: Variables are not related1: Variables are perfectly related (same)−1: Variables are negatively related (dierent)

I R2 = r2

Hypothesis Testing

I Null hypothesis H0:

There is no relationship between X and Y

β1 = 0

I Alternative hypothesis H1:

There is some relationship between X and Y

β1 6= 0

I Seek to reject hypothesis H0 with small “probability” (p-value)of making a mistake

I Important topic, but beyond the scope of the course

Multiple Linear Regression

I Usually more than one feature is available

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

I In general:

Y = β0 +

p∑j=1

βjXj

Multiple Linear Regression

X1

X2

Y

Estimating Coeicients

I Prediction:

yi = β0 +

p∑j=1

βjxij

I Errors (yi are true values):

ei = yi − yi

I Residual Sum of Squares

RSS = e21 + e2

2 + e23 + · · ·+ e2

n =

n∑i=1

e2i

I How to minimize RSS? Linear algebra!

Inference from Linear Regression

1. Are predictors X1, X2, . . . , Xp really predicting Y ?

2. Is only a subset of predictors useful?

3. How well does linear model fit data?

4. What Y should be predict and how accurate is it?

Inference 1

“Are predictors X1, X2, . . . , Xp really predicting Y ?”

I Null hypothesis H0:

There is no relationship between X and Y

β1 = 0

I Alternative hypothesis H1:

There is some relationship between X and Y

β1 6= 0

I Seek to reject hypothesis H0 with small “probability” (p-value)of making a mistake

I See ISL 3.2.2 on how to compute F-statistic and reject H0

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of features

I RSS always decreases with more features!I Other measures control for number of variables:

1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!I More on how to do this later

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!

I Other measures control for number of variables:1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!I More on how to do this later

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!I Other measures control for number of variables:

1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!I More on how to do this later

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!I Other measures control for number of variables:

1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!

I More on how to do this later

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!I Other measures control for number of variables:

1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!I More on how to do this later

Inference 3

“How well does linear model fit data?”

I R2 also always increases with more features (like RSS)I Is the model linear? Plot it:

Sales

Radio

TV

I More on this later

Inference 4

“What Y should be predict and how accurate is it?”

I The linear model is used to make predictions:

ypredicted = β0 + β1 xnew

I Can also predict a confidence interval (based on estimate on ε):

I Example: Spent $100 000 on TV and $20 000 on Radioadvertising

I Confidence interval: predict f(X) (the average response):

f(x) ∈ [10.985, 11, 528]

I Prediction interval: predict f(X) + ε (response + possiblenoise)

f(x) ∈ [7.930, 14.580]

Inference 4

“What Y should be predict and how accurate is it?”

I The linear model is used to make predictions:

ypredicted = β0 + β1 xnew

I Can also predict a confidence interval (based on estimate on ε):I Example: Spent $100 000 on TV and $20 000 on Radio

advertisingI Confidence interval: predict f(X) (the average response):

f(x) ∈ [10.985, 11, 528]

I Prediction interval: predict f(X) + ε (response + possiblenoise)

f(x) ∈ [7.930, 14.580]

R notebook