Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… ·...

Simple Linear Regression (single variable)Introduction to Machine Learning

Marek Petrik

January 31, 2017

Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications in R”(Springer, 2013) with permission from the authors: G. James, D. Wien, T. Hastie and R. Tibshirani

Last Class

1. Basic machine learning framework

Y = f(X)

2. Prediction vs inference: predict Y vs understand f

3. Parametric vs non-parametric: linear regression vs k-NN

4. Classification vs regressions: k-NN vs linear regression

5. Why we need to have a test set: overfiing

What is Machine LearningI Discover unknown function f :

Y = f(X)

I X = set of features, or inputsI Y = target, or response

0 50 100 200 300

0 10 20 30 40 50

0 20 40 60 80 100

5Newspaper

Sales = f(TV,Radio,Newspaper)

Errors in Machine Learning: World is Noisy

I World is too complex to model preciselyI Many features are not captured in data setsI Need to allow for errors ε in f :

Y = f(X) + ε

How Good are Predictions?

I Learned function fI Test data: (x1, y1), (x2, y2), . . .

I Mean Squared Error (MSE):

MSE =1

n∑i=1

(yi − f(xi))2

I This is the estimate of:

MSE = E[(Y − f(X))2] =1

|Ω|∑ω∈Ω

(Y (ω)− f(X(ω)))2

I Important: Samples xi are i.i.d.

Do We Need Test Data?I Why not just test on the training data?

0 20 40 60 80 100

2 5 10 20

Flexibility

I Flexibility is the degree of polynomial being fitI Gray line: training error, red line: testing error

Types of Function f

Regression: continuous target

f : X → R

Years of Education

Classification: discrete target

f : X → 1, 2, 3, . . . , k

Bias-Variance Decomposition

Y = f(X) + ε

Mean Squared Error can be decomposed as:

MSE = E(Y − f(X))2 = Var(f(X))︸︷︷︸Variance

+ (E(f(X)))2︸︷︷︸Bias

+ Var(ε)

I Bias: How well would method work with infinite dataI Variance: How much does output change with dierent data

I Basics of linear regressionI Why linear regressionI How to compute itI Why compute it

Simple Linear RegressionI We have only one feature

Y ≈ β0 + β1X Y = β0 + β1X + ε

I Example:

0 50 100 150 200 250 300

Sales ≈ β0 + β1 × TV

How To Estimate CoeicientsI No line that will have no errors on data xiI Prediction:

yi = β0 + β1xi

I Errors (yi are true values):

ei = yi − yi

0 50 100 150 200 250 300

Residual Sum of Squares

I Residual Sum of Squares

RSS = e21 + e2

2 + e23 + · · ·+ e2

n∑i=1

I Equivalently:

n∑i=1

(yi − β0 − β1xi)2

Minimizing Residual Sum of Squares

minβ0,β1

RSS = minβ0,β1

n∑i=1

e2i = min

β0,β1

n∑i=1

(yi − β0 − β1xi)2

Minimizing Residual Sum of Squares

minβ0,β1

RSS = minβ0,β1

n∑i=1

e2i = min

β0,β1

n∑i=1

(yi − β0 − β1xi)2

5 6 7 8 9

Solving for Minimal RSS

minβ0,β1

n∑i=1

(yi − β0 − β1xi)2

I RSS is a convex function of β0, β1

I Minimum achieved when (recall the chain rule):

∂ RSS

∂β0= −2

n∑i=1

(yi − β0 − β1xi) = 0

∂ RSS

∂β1= −2

n∑i=1

xi(yi − β0 − β1xi) = 0

Linear Regression Coeicients

minβ0,β1

n∑i=1

(yi − β0 − β1xi)2

Solution:

β0 = y − β1x

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=

∑ni=1 xi(yi − y)∑ni=1 xi(xi − x)

n∑i=1

xi y =1

n∑i=1

Why Minimize RSS

1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)

2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)

3. It is convenient: can be solved in closed form

Why Minimize RSS

Bias in Estimation

I Assume a true value µ?

I Estimate µ is unbiased when E[µ] = µ?

I Standard mean estimate is unbiased (e.g. X ∼ N (0, 1)):

n∑i=1

I Standard variance estimate is biased (e.g. X ∼ N (0, 1)):

n∑i=1

(Xi − X)2

Linear Regression is Unbiased

−2 −1 0 1 2

Gauss-Markov Theorem (ESL 3.2.2)

Solution of Linear Regression

0 50 100 150 200 250 300

How Good is the Fit

I How well is linear regression predicting the training data?I Can we be sure that TV advertising really influences the sales?I What is the probability that we just got lucky?

R2 Statistic

R2 = 1− RSS

TSS= 1−

∑ni=1(yi − yi)2∑ni=1(yi − y)2

I RSS - residual sum of squares, TSS - total sum of squaresI R2 measures the goodness of the fit as a proportionI Proportion of data variance explained by the modelI Extreme values:

0: Model does not explain data1: Model explains data perfectly

Example: TV Impact on Sales

0 50 100 150 200 250 300

R2 = 0.61

Example: TV Impact on Sales

0 50 100 150 200 250 300

R2 = 0.61

Example: Radio Impact on Sales

0 10 20 30 40 50

R2 = 0.33

Example: Radio Impact on Sales

0 10 20 30 40 50

R2 = 0.33

Example: Newspaper Impact on Sales

0 20 40 60 80 100

Newspaper

R2 = 0.05

Example: Newspaper Impact on Sales

0 20 40 60 80 100

Newspaper

R2 = 0.05

Correlation Coeicient

I Measures dependence between two random variables X and Y

r =Cov(X,Y )√

Var(X)√

Var(Y )

I Correlation coeicient r is between [−1, 1]

0: Variables are not related1: Variables are perfectly related (same)−1: Variables are negatively related (dierent)

I R2 = r2

Correlation Coeicient

I Measures dependence between two random variables X and Y

r =Cov(X,Y )√

Var(X)√

Var(Y )

I Correlation coeicient r is between [−1, 1]

0: Variables are not related1: Variables are perfectly related (same)−1: Variables are negatively related (dierent)

I R2 = r2

Hypothesis Testing

I Null hypothesis H0:

There is no relationship between X and Y

β1 = 0

I Alternative hypothesis H1:

There is some relationship between X and Y

β1 6= 0

I Seek to reject hypothesis H0 with small “probability” (p-value)of making a mistake

I Important topic, but beyond the scope of the course

Multiple Linear Regression

I Usually more than one feature is available

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

I In general:

Y = β0 +

p∑j=1

Multiple Linear Regression

Estimating Coeicients

I Prediction:

yi = β0 +

p∑j=1

βjxij

I Errors (yi are true values):

ei = yi − yi

I Residual Sum of Squares

RSS = e21 + e2

2 + e23 + · · ·+ e2

n∑i=1

I How to minimize RSS? Linear algebra!

Inference from Linear Regression

1. Are predictors X1, X2, . . . , Xp really predicting Y ?

2. Is only a subset of predictors useful?

3. How well does linear model fit data?

4. What Y should be predict and how accurate is it?

Inference 1

“Are predictors X1, X2, . . . , Xp really predicting Y ?”

I Null hypothesis H0:

There is no relationship between X and Y

β1 = 0

I Alternative hypothesis H1:

There is some relationship between X and Y

β1 6= 0

I Seek to reject hypothesis H0 with small “probability” (p-value)of making a mistake

I See ISL 3.2.2 on how to compute F-statistic and reject H0

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of features

I RSS always decreases with more features!I Other measures control for number of variables:

1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!I More on how to do this later

Inference 2

I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!

I Other measures control for number of variables:1. Mallows Cp

Inference 2

I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!I Other measures control for number of variables:

1. Mallows Cp

Inference 2

1. Mallows Cp

I Testing all subsets of features is impractical: 2p options!

I More on how to do this later

Inference 2

1. Mallows Cp

Inference 3

“How well does linear model fit data?”

I R2 also always increases with more features (like RSS)I Is the model linear? Plot it:

I More on this later

Inference 4

“What Y should be predict and how accurate is it?”

I The linear model is used to make predictions:

ypredicted = β0 + β1 xnew

I Can also predict a confidence interval (based on estimate on ε):

I Example: Spent $100 000 on TV and $20 000 on Radioadvertising

I Confidence interval: predict f(X) (the average response):

f(x) ∈ [10.985, 11, 528]

I Prediction interval: predict f(X) + ε (response + possiblenoise)

f(x) ∈ [7.930, 14.580]

Inference 4

“What Y should be predict and how accurate is it?”

I The linear model is used to make predictions:

ypredicted = β0 + β1 xnew

I Can also predict a confidence interval (based on estimate on ε):I Example: Spent $100 000 on TV and $20 000 on Radio

advertisingI Confidence interval: predict f(X) (the average response):

f(x) ∈ [10.985, 11, 528]

I Prediction interval: predict f(X) + ε (response + possiblenoise)

f(x) ∈ [7.930, 14.580]

R notebook

Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… ·...

Documents

Transcript of Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… ·...

1 Linear Regression With One Variable

Variable Importance Assessment Regression

Model-Free Variable Screening, Sparse Regression Analysis ...

Poisson Regression for Regression of Counts and Rates ... · Poisson regression for counts Response Variable is a count Explanatory Variable(s): If they are categorical (i.e., you

multiple regression variable selection - New York …people.stern.nyu.edu/gsimon/Pamphlets/MultipleRegressionModel...MULTIPLE REGRESSION VARIABLE SELECTION Documents prepared for use

LINEAR REGRESSION WITH ONE INDEPENDENT VARIABLE …balkire/CE5403/LinearRegression.pdf · 4.3 LINEAR REGRESSION WITH ONE INDEPENDENT VARIABLE 4.3.1 . Basic Regression Model First,

Semiparametric Bayesian latent variable regression for skewed …dbandyop/SkewPaperCombined.pdf · 2018-10-15 · Biometrics , 1{20 Semiparametric Bayesian latent variable regression

Linear Regression. Simple Linear Regression Using one variable to … 1) explain the variability of another variable 2) predict the value of another variable.

Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.

Instrumental Variable Regression

Multiple Regression. Regression Attempts to predict one criterion variable using one predictor variable Addresses the question: Does the predictor significantly.

Stabilizing Variable Selection and Regression

Online Instrumental Variable Regression with Applications ..../arunvenk/papers/2015/OnlineIVR_Venkatraman.pdfOnline Instrumental Variable Regression with Applications to Online Linear

Regression with an Imputed Dependent Variable

Bayesian variable selection regression for genome-wide ... · tion Regression (BVSR) to GWAS (or other similar large-scale problems). Variable selection regression provides a very

BUILDING THE REGRESSION MODEL Data preparation Variable reduction Model Selection Model validation Procedures for variable reduction 1 Building the Regression.

The Classical Two-Variable Regression Model

5. Dummy-Variable Regression and Analysis of Variance · Lecture Notes 5. Dummy-Variable ... Dummy-Variable Regression and Analysis of Variance 2 2. ... Dummy-Variable Regression

Block Variable Selection in Multivariate Regression and ...

S7 Logistic Regression - NKI · In logistic regression, a variable is analyzed as a continuous variable or a categorical variable. Continuous variable: for each increase of 1 unit,