Post on 21-Jul-2020
Simple Linear Regression (single variable)Introduction to Machine Learning
Marek Petrik
January 31, 2017
Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications in R”(Springer, 2013) with permission from the authors: G. James, D. Wien, T. Hastie and R. Tibshirani
Last Class
1. Basic machine learning framework
Y = f(X)
2. Prediction vs inference: predict Y vs understand f
3. Parametric vs non-parametric: linear regression vs k-NN
4. Classification vs regressions: k-NN vs linear regression
5. Why we need to have a test set: overfiing
What is Machine LearningI Discover unknown function f :
Y = f(X)
I X = set of features, or inputsI Y = target, or response
0 50 100 200 300
51
01
52
02
5
TV
Sa
les
0 10 20 30 40 50
51
01
52
02
5
Radio
Sa
les
0 20 40 60 80 100
51
01
52
02
5Newspaper
Sa
les
Sales = f(TV,Radio,Newspaper)
Errors in Machine Learning: World is Noisy
I World is too complex to model preciselyI Many features are not captured in data setsI Need to allow for errors ε in f :
Y = f(X) + ε
How Good are Predictions?
I Learned function fI Test data: (x1, y1), (x2, y2), . . .
I Mean Squared Error (MSE):
MSE =1
n
n∑i=1
(yi − f(xi))2
I This is the estimate of:
MSE = E[(Y − f(X))2] =1
|Ω|∑ω∈Ω
(Y (ω)− f(X(ω)))2
I Important: Samples xi are i.i.d.
Do We Need Test Data?I Why not just test on the training data?
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Me
an
Sq
ua
red
Err
or
I Flexibility is the degree of polynomial being fitI Gray line: training error, red line: testing error
Types of Function f
Regression: continuous target
f : X → R
Years of Education
Sen
iorit
y
Incom
e
Classification: discrete target
f : X → 1, 2, 3, . . . , k
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X2
Bias-Variance Decomposition
Y = f(X) + ε
Mean Squared Error can be decomposed as:
MSE = E(Y − f(X))2 = Var(f(X))︸ ︷︷ ︸Variance
+ (E(f(X)))2︸ ︷︷ ︸Bias
+ Var(ε)
I Bias: How well would method work with infinite dataI Variance: How much does output change with dierent data
sets
Today
I Basics of linear regressionI Why linear regressionI How to compute itI Why compute it
Simple Linear RegressionI We have only one feature
Y ≈ β0 + β1X Y = β0 + β1X + ε
I Example:
0 50 100 150 200 250 300
510
1520
25
TV
Sal
es
Sales ≈ β0 + β1 × TV
How To Estimate CoeicientsI No line that will have no errors on data xiI Prediction:
yi = β0 + β1xi
I Errors (yi are true values):
ei = yi − yi
0 50 100 150 200 250 300
510
15
20
25
TV
Sale
s
Residual Sum of Squares
I Residual Sum of Squares
RSS = e21 + e2
2 + e23 + · · ·+ e2
n =
n∑i=1
e2i
I Equivalently:
RSS =
n∑i=1
(yi − β0 − β1xi)2
Minimizing Residual Sum of Squares
minβ0,β1
RSS = minβ0,β1
n∑i=1
e2i = min
β0,β1
n∑i=1
(yi − β0 − β1xi)2
RS
S
β1
β0
Minimizing Residual Sum of Squares
minβ0,β1
RSS = minβ0,β1
n∑i=1
e2i = min
β0,β1
n∑i=1
(yi − β0 − β1xi)2
β0
β1
2.15
2.2
2.3
2.5
3
3
3
3
5 6 7 8 9
0.0
30.0
40.0
50.0
6
Solving for Minimal RSS
minβ0,β1
n∑i=1
(yi − β0 − β1xi)2
I RSS is a convex function of β0, β1
I Minimum achieved when (recall the chain rule):
∂ RSS
∂β0= −2
n∑i=1
(yi − β0 − β1xi) = 0
∂ RSS
∂β1= −2
n∑i=1
xi(yi − β0 − β1xi) = 0
Linear Regression Coeicients
minβ0,β1
n∑i=1
(yi − β0 − β1xi)2
Solution:
β0 = y − β1x
β1 =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2=
∑ni=1 xi(yi − y)∑ni=1 xi(xi − x)
where
x =1
n
n∑i=1
xi y =1
n
n∑i=1
yi
Why Minimize RSS
1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)
2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)
3. It is convenient: can be solved in closed form
Why Minimize RSS
1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)
2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)
3. It is convenient: can be solved in closed form
Why Minimize RSS
1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)
2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)
3. It is convenient: can be solved in closed form
Why Minimize RSS
1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)
2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)
3. It is convenient: can be solved in closed form
Bias in Estimation
I Assume a true value µ?
I Estimate µ is unbiased when E[µ] = µ?
I Standard mean estimate is unbiased (e.g. X ∼ N (0, 1)):
E
[1
n
n∑i=1
Xi
]= 0
I Standard variance estimate is biased (e.g. X ∼ N (0, 1)):
E
[1
n
n∑i=1
(Xi − X)2
]6= 1
Linear Regression is Unbiased
−2 −1 0 1 2
−10
−5
05
10
X
Y
−2 −1 0 1 2
−10
−5
05
10
X
Y
Gauss-Markov Theorem (ESL 3.2.2)
Solution of Linear Regression
0 50 100 150 200 250 300
510
1520
25
TV
Sal
es
How Good is the Fit
I How well is linear regression predicting the training data?I Can we be sure that TV advertising really influences the sales?I What is the probability that we just got lucky?
R2 Statistic
R2 = 1− RSS
TSS= 1−
∑ni=1(yi − yi)2∑ni=1(yi − y)2
I RSS - residual sum of squares, TSS - total sum of squaresI R2 measures the goodness of the fit as a proportionI Proportion of data variance explained by the modelI Extreme values:
0: Model does not explain data1: Model explains data perfectly
Example: TV Impact on Sales
0 50 100 150 200 250 300
510
1520
25
TV
Sal
es
R2 = 0.61
Example: TV Impact on Sales
0 50 100 150 200 250 300
510
1520
25
TV
Sal
es
R2 = 0.61
Example: Radio Impact on Sales
0 10 20 30 40 50
510
1520
25
Radio
Sal
es
R2 = 0.33
Example: Radio Impact on Sales
0 10 20 30 40 50
510
1520
25
Radio
Sal
es
R2 = 0.33
Example: Newspaper Impact on Sales
0 20 40 60 80 100
510
1520
25
Newspaper
Sal
es
R2 = 0.05
Example: Newspaper Impact on Sales
0 20 40 60 80 100
510
1520
25
Newspaper
Sal
es
R2 = 0.05
Correlation Coeicient
I Measures dependence between two random variables X and Y
r =Cov(X,Y )√
Var(X)√
Var(Y )
I Correlation coeicient r is between [−1, 1]
0: Variables are not related1: Variables are perfectly related (same)−1: Variables are negatively related (dierent)
I R2 = r2
Correlation Coeicient
I Measures dependence between two random variables X and Y
r =Cov(X,Y )√
Var(X)√
Var(Y )
I Correlation coeicient r is between [−1, 1]
0: Variables are not related1: Variables are perfectly related (same)−1: Variables are negatively related (dierent)
I R2 = r2
Hypothesis Testing
I Null hypothesis H0:
There is no relationship between X and Y
β1 = 0
I Alternative hypothesis H1:
There is some relationship between X and Y
β1 6= 0
I Seek to reject hypothesis H0 with small “probability” (p-value)of making a mistake
I Important topic, but beyond the scope of the course
Multiple Linear Regression
I Usually more than one feature is available
sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε
I In general:
Y = β0 +
p∑j=1
βjXj
Multiple Linear Regression
X1
X2
Y
Estimating Coeicients
I Prediction:
yi = β0 +
p∑j=1
βjxij
I Errors (yi are true values):
ei = yi − yi
I Residual Sum of Squares
RSS = e21 + e2
2 + e23 + · · ·+ e2
n =
n∑i=1
e2i
I How to minimize RSS? Linear algebra!
Inference from Linear Regression
1. Are predictors X1, X2, . . . , Xp really predicting Y ?
2. Is only a subset of predictors useful?
3. How well does linear model fit data?
4. What Y should be predict and how accurate is it?
Inference 1
“Are predictors X1, X2, . . . , Xp really predicting Y ?”
I Null hypothesis H0:
There is no relationship between X and Y
β1 = 0
I Alternative hypothesis H1:
There is some relationship between X and Y
β1 6= 0
I Seek to reject hypothesis H0 with small “probability” (p-value)of making a mistake
I See ISL 3.2.2 on how to compute F-statistic and reject H0
Inference 2
“Is only a subset of predictors useful?”
I Compare prediction accuracy with only a subset of features
I RSS always decreases with more features!I Other measures control for number of variables:
1. Mallows Cp
2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2
I Testing all subsets of features is impractical: 2p options!I More on how to do this later
Inference 2
“Is only a subset of predictors useful?”
I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!
I Other measures control for number of variables:1. Mallows Cp
2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2
I Testing all subsets of features is impractical: 2p options!I More on how to do this later
Inference 2
“Is only a subset of predictors useful?”
I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!I Other measures control for number of variables:
1. Mallows Cp
2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2
I Testing all subsets of features is impractical: 2p options!I More on how to do this later
Inference 2
“Is only a subset of predictors useful?”
I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!I Other measures control for number of variables:
1. Mallows Cp
2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2
I Testing all subsets of features is impractical: 2p options!
I More on how to do this later
Inference 2
“Is only a subset of predictors useful?”
I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!I Other measures control for number of variables:
1. Mallows Cp
2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2
I Testing all subsets of features is impractical: 2p options!I More on how to do this later
Inference 3
“How well does linear model fit data?”
I R2 also always increases with more features (like RSS)I Is the model linear? Plot it:
Sales
Radio
TV
I More on this later
Inference 4
“What Y should be predict and how accurate is it?”
I The linear model is used to make predictions:
ypredicted = β0 + β1 xnew
I Can also predict a confidence interval (based on estimate on ε):
I Example: Spent $100 000 on TV and $20 000 on Radioadvertising
I Confidence interval: predict f(X) (the average response):
f(x) ∈ [10.985, 11, 528]
I Prediction interval: predict f(X) + ε (response + possiblenoise)
f(x) ∈ [7.930, 14.580]
Inference 4
“What Y should be predict and how accurate is it?”
I The linear model is used to make predictions:
ypredicted = β0 + β1 xnew
I Can also predict a confidence interval (based on estimate on ε):I Example: Spent $100 000 on TV and $20 000 on Radio
advertisingI Confidence interval: predict f(X) (the average response):
f(x) ∈ [10.985, 11, 528]
I Prediction interval: predict f(X) + ε (response + possiblenoise)
f(x) ∈ [7.930, 14.580]
R notebook