Linear Regression - Indian Institute of Technology...

67
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan [email protected]

Transcript of Linear Regression - Indian Institute of Technology...

Page 1: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Linear RegressionCSL465/603 - Fall 2016Narayanan C Krishnan

[email protected]

Page 2: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Outline• Univariate regression• Multivariate regression• Probabilistic view of regression• Loss functions• Bias-Variance analysis• Regularization

Linear Regression CSL465/603 - Machine Learning 2

Page 3: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Example - Green Chilies Entertainment Company

Cost of making the film (in crores of Rs)

Earnings from the film (in crores of Rs)

Linear Regression CSL465/603 - Machine Learning 3

Page 4: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Notations• Training dataset• Number of examples - 𝑁• Input variable - x#• Target variable - 𝑦%

• Goal: Learn function that predicts 𝑦 for new input x

Cost of Film (Crores of Rs) -x

Profit/Loss (Crores of Rs) -y

98.28 199.6940.22 93.6962.07 100.33… …

Linear Regression CSL465/603 - Machine Learning 4

Page 5: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Linear Regression• Simplest form

𝑓(x) = 𝑤+ + 𝑤-x

Cost of making the film (in crores of Rs)

Earnings from the film (in crores of Rs)

Linear Regression CSL465/603 - Machine Learning 5

Page 6: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Least Mean Squares - Cost Function• Choose parameters 𝑤+and 𝑤-(or w ) so that 𝑓 x is

as close as to 𝑦

Cost of making the film (in crores of Rs)

Earnings from the film (in crores of Rs)

Linear Regression CSL465/603 - Machine Learning 6

Page 7: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Least Mean Squares - Cost Function - Parameter Space (1)• Let

𝐽 𝑤+, 𝑤- = -23∑ 𝑓 𝑥6 − 𝑦6 23%8-

Linear Regression CSL465/603 - Machine Learning 7

Page 8: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Least Mean Squares - Cost Function - Parameter Space (2)• Let

𝐽 𝑤+, 𝑤- = -23∑ 𝑓 𝑥% − 𝑦% 23%8-

Linear Regression CSL465/603 - Machine Learning 8

Page 9: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Least Mean Squares - Cost Function - Parameter Space (3)• Let

𝐽 𝑤+, 𝑤- = -23∑ 𝑓 𝑥6 − 𝑦6 23%8-

Linear Regression CSL465/603 - Machine Learning 9

Page 10: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Plot of the Error Surface

Linear Regression CSL465/603 - Machine Learning 10

Page 11: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Contour Plot of Error Surface

Linear Regression CSL465/603 - Machine Learning 11

Page 12: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Estimating Optimal Parameters

Linear Regression CSL465/603 - Machine Learning 12

Page 13: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Gradient Descent – Basic Principle

• Minimize 𝐽 w = -23∑ 𝑓 x% − 𝑦%3%8-

2

• Start with an initial estimate for w• Keep changing w so that 𝐽 w is progressively

reduced• Stop when no change or have reached the

minimum

Linear Regression CSL465/603 - Machine Learning 13

Page 14: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Gradient Descent - Intuition

Linear Regression CSL465/603 - Machine Learning 14

Page 15: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Effect of Learning Parameter• Too small value – slow

convergence• Too large value –

oscillates widely and may not converge

Linear Regression CSL465/603 - Machine Learning 15

Page 16: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Gradient Descent – Local Minima• Depending on the function 𝐽 w , gradient descent

can get stuck at local minima

Linear Regression CSL465/603 - Machine Learning 16

Page 17: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Gradient Descent for Regression• Convex error function

𝐽 w =12𝑁; 𝑓 x# − 𝑦% 2

3

%8-• Geometrically – error surface is bowl shaped.• Only global minima

Exercise – Prove that the sum of squared error is a convex function

Linear Regression CSL465/603 - Machine Learning 17

Page 18: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Parameter Update (1) • Minimize

𝐽 w =12𝑁; 𝑓 x# − 𝑦% 2

3

%8-

Linear Regression CSL465/603 - Machine Learning 18

Page 19: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Parameter Update (2)• Repeat till convergence

𝑤+ = 𝑤+ − 𝛼1𝑁; 𝑓 x# − 𝑦%

3

%8-

𝑤- = 𝑤- − 𝛼1𝑁; 𝑓 x# − 𝑦% x#

3

%8-

Linear Regression CSL465/603 - Machine Learning 19

Page 20: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Example – Iteration 0

Regression Function Error Function

Linear Regression CSL465/603 - Machine Learning 20

Page 21: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Example – Iteration 1

Regression Function Error Function

Linear Regression CSL465/603 - Machine Learning 21

Page 22: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Example – Iteration 2

Regression Function Error Function

Linear Regression CSL465/603 - Machine Learning 22

Page 23: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Example – Iteration 4

Regression Function Error Function

Linear Regression CSL465/603 - Machine Learning 23

Page 24: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Example – Iteration 7

Error FunctionRegression Function

Linear Regression CSL465/603 - Machine Learning 24

Page 25: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Example – Iteration 9

Regression Function Error Function

Linear Regression CSL465/603 - Machine Learning 25

Page 26: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Gradient Descent • Batch Mode• Update includes contribution of all data points

𝑤+ = 𝑤+ − 𝛼1𝑁; 𝑓 x# − 𝑦%

3

%8-

𝑤- = 𝑤- − 𝛼1𝑁; 𝑓 x# − 𝑦% x#

3

%8-• Will talk stochastic gradient descent later (neural

networks).

Linear Regression CSL465/603 - Machine Learning 26

Page 27: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Multivariate Linear Regression

• Dimension of the input data - 𝐷

Cost of Film (Crores of Rs)

Celebrity status of the protagonist

# of theatres release

Age of the protagonist

Earnings (Crores of Rs) -y

75.72 7.57 32 52 157.3918.74 1.87 16 68 81.9350.96 5.09 27 35 131.95… … … … …

Linear Regression CSL465/603 - Machine Learning 27

Page 28: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Multivariate Linear Regression -Formulation• Simplest model:

𝑓 x = 𝑤+ + 𝑤-𝑥- + 𝑤2𝑥2 + ⋯+𝑤?𝑥?

• Parameters to learn: 𝑤+,𝑤-, … ,𝑤? = 𝐰

• Cost function: 𝐽 𝐰 = -23∑ 𝑓 x# − 𝑦% 23%8-

• Update equation: 𝑤B = 𝑤B − 𝛼-3∑ 𝑓 x% − 𝑦% 𝑥%B3%8-

Linear Regression CSL465/603 - Machine Learning 28

Page 29: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Gradient Descent• Parameter update equation

𝑤B = 𝑤B − 𝛼1𝑁; 𝑓 x% − 𝑦% 𝑥%B

3

%8-

Linear Regression CSL465/603 - Machine Learning 29

Page 30: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Feature Scaling Multivariate Linear Regression (1)

• Transform features to be of same scale

Cost of Film (Crores of Rs)

Celebrity status of the protagonist

# of theatres release

Age of the protagonist

Profit/Loss (Crores of Rs) -y

75.72 7.57 32 52 157.3918.74 1.87 16 68 81.9350.96 5.09 27 35 131.95… … … … …

Linear Regression CSL465/603 - Machine Learning 30

Page 31: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Feature Scaling for Multivariate Linear Regression (2)• Normalization −1 ≤ 𝑥D ≤ 1 or 0 ≤ 𝑥D ≤ 1

• Standardization – mean 0 and standard deviation 1

Linear Regression CSL465/603 - Machine Learning 31

Page 32: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Multivariate Linear Regression Analytical Solution

• Design Matrix and Target Vector

Cost of Film (Crores of Rs)

Celebrity status of the protagonist

# of theatres release

Age of the protagonist

Profit/Loss (Crores of Rs) -y

75.72 7.57 32 52 157.3918.74 1.87 16 68 81.9350.96 5.09 27 35 131.95… … … … …

X =

1 𝑥-- …𝑥-?1 𝑥2- …𝑥2?⋮ ⋮ ⋮⋮1 𝑥3- …𝑥3?

𝑌 =

𝑦-𝑦2⋮𝑦3

Linear Regression CSL465/603 - Machine Learning 32

Page 33: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Least Squares Method

𝑓 X = X𝑤 =

1 𝑥-- …𝑥-?1 𝑥2- …𝑥2?⋮ ⋮ ⋮⋮1 𝑥3- …𝑥3?

𝑤+𝑤-⋮𝑤?

=

𝑦-𝑦2⋮𝑦3

= 𝑌

𝐽 𝑤 =12; 𝑓 x# − 𝑦% 2

3

%8-

Linear Regression CSL465/603 - Machine Learning 33

Page 34: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Normal Equations

minL𝐽 𝑊 = min

L12 X𝑊 − 𝑌 N X𝑊 − 𝑌

• Finding the gradient wrt 𝑊 and equate it to 0

Linear Regression CSL465/603 - Machine Learning 34

Page 35: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Analytical Solution• Advantage• No need for the learning parameter 𝛼!• No need for iterative updates

• Disadvantage• Need to perform matrix inversion

• Pseudo-Inverse of the matrix 𝑋N𝑋 P-𝑋N

• Sometimes we deal with non-invertible matrices (redundant features)

Linear Regression CSL465/603 - Machine Learning 35

Page 36: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Probabilistic View of Linear Regression (1)• Let 𝑦 = 𝑓(𝑥) + 𝜖• 𝜖 is the error term that captures unmodeled effects

or random noise.• 𝜖~𝒩 0, 𝜎2 - Gaussian distribution

Linear Regression CSL465/603 - Machine Learning 36

Page 37: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Probabilistic View of Linear Regression (2)• Let 𝑦 = 𝑓(𝑥) + 𝜖• 𝜖 is the error term that captures unmodeled effects

or random noise.• 𝜖~𝒩 0, 𝜎2 - Gaussian distribution - why?• 𝒩 0, 𝜎2 has maximum

entropy among all real-valued distributions with a specified variance 𝜎2

• 3-𝜎rule:

Linear Regression CSL465/603 - Machine Learning 37

Page 38: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

1.5. Decision Theory 47

Figure 1.28 The regression function y(x),which minimizes the expectedsquared loss, is given by themean of the conditional distri-bution p(t|x).

t

xx0

y(x0)

y(x)

p(t|x0)

which is the conditional average of t conditioned on x and is known as the regressionfunction. This result is illustrated in Figure 1.28. It can readily be extended to mul-tiple target variables represented by the vector t, in which case the optimal solutionis the conditional average y(x) = Et[t|x].Exercise 1.25

We can also derive this result in a slightly different way, which will also shedlight on the nature of the regression problem. Armed with the knowledge that theoptimal solution is the conditional expectation, we can expand the square term asfollows

{y(x) − t}2 = {y(x) − E[t|x] + E[t|x] − t}2

= {y(x) − E[t|x]}2 + 2{y(x) − E[t|x]}{E[t|x] − t} + {E[t|x] − t}2

where, to keep the notation uncluttered, we use E[t|x] to denote Et[t|x]. Substitutinginto the loss function and performing the integral over t, we see that the cross-termvanishes and we obtain an expression for the loss function in the form

E[L] =!

{y(x) − E[t|x]}2 p(x) dx +!

{E[t|x] − t}2p(x) dx. (1.90)

The function y(x) we seek to determine enters only in the first term, which will beminimized when y(x) is equal to E[t|x], in which case this term will vanish. Thisis simply the result that we derived previously and that shows that the optimal leastsquares predictor is given by the conditional mean. The second term is the varianceof the distribution of t, averaged over x. It represents the intrinsic variability ofthe target data and can be regarded as noise. Because it is independent of y(x), itrepresents the irreducible minimum value of the loss function.

As with the classification problem, we can either determine the appropriate prob-abilities and then use these to make optimal decisions, or we can build models thatmake decisions directly. Indeed, we can identify three distinct approaches to solvingregression problems given, in order of decreasing complexity, by:

(a) First solve the inference problem of determining the joint density p(x, t). Thennormalize to find the conditional density p(t|x), and finally marginalize to findthe conditional mean given by (1.89).

Probabilistic View of Linear Regression (3)• Let 𝑦 = 𝑓(x) + 𝜖• 𝜖 is the error term that captures unmodeled effects

or random noise.• 𝜖~𝒩 0, 𝜎2 - Gaussian distribution

• Then 𝑃 𝜖 =• And 𝑃 𝑦 x =

Linear Regression CSL465/603 - Machine Learning 38

Page 39: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Probabilistic View of Linear Regression (4)𝑃 𝑦-, … , 𝑦3 x-, … , x3 =

𝑃 𝑦-, … , 𝑦3 x-, … , x3;𝑊 =

Linear Regression CSL465/603 - Machine Learning 39

Page 40: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Maximizing the Likelihood• Maximize 𝐿 𝑊 =∏ 𝑃 𝑦%|x%;𝑊3

%8-

Linear Regression CSL465/603 - Machine Learning 40

Page 41: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Loss Functions• Squared loss 𝑓 𝑥 − 𝑦 2

• Absolute loss 𝑓 𝑥 − 𝑦• Dead band loss max 0, 𝑓 𝑥 − 𝑦 − 𝜖 , 𝜖 ∈ ℛ]

Linear Regression CSL465/603 - Machine Learning 41

Page 42: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Loss Functions• Problem with squared loss

Linear Regression CSL465/603 - Machine Learning 42

Page 43: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Linear Regression with Absolute Loss Function• Objective

minL

; X𝑊 − 𝑌3

%8-• Non-differentiable, so cannot take the gradient

descent approach• Solution: frame as a constrained optimization

problem• Introduce new variables 𝑣 ∈ ℛ3, 𝑣% ≥ x%𝑊 − 𝑦%

minL,`

;𝑣%, subjectto𝑣% ≤3

%8-

x%𝑊 − 𝑦% ≤ 𝑣%

Linear Regression CSL465/603 - Machine Learning 43

Page 44: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Linear Regression withAbsolute Loss Function - Example• LMS output • LP output

Linear Regression CSL465/603 - Machine Learning 44

Page 45: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Some Additional Notations• Underlying response function (Target Concept) – 𝐶• Actual observed response – 𝑦 = 𝐶 x + 𝜖

𝜖~𝒩 0, 𝜎2 , E 𝑦/x = 𝐶(x)• Predicted response based on the model learned

from dataset 𝐴 - 𝑓 x; 𝐴• Expected response averaged over all datasets –𝑓m x = En 𝑓 x; 𝐴• Expected 𝐿2 error on a new test instance x∗ -Epqq = En 𝑓 x∗; 𝐴 − 𝑦 2

Linear Regression CSL465/603 - Machine Learning 45

Page 46: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (1)

Linear Regression CSL465/603 - Machine Learning 46

Page 47: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (2)

Linear Regression CSL465/603 - Machine Learning 47

Page 48: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (3)• Root Mean Square Error

Linear Regression CSL465/603 - Machine Learning 48

Page 49: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (4)• 9th degree polynomial fit with more sample data

Linear Regression CSL465/603 - Machine Learning 49

Page 50: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (5)• Expected square loss -

E 𝐿 = r 𝑓 x − 𝑦 2𝑃 x, 𝑦 𝑑x𝑑𝑦�

Linear Regression CSL465/603 - Machine Learning 50

Page 51: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (6)• Expected square loss -

E 𝐿 = r 𝑓 x − 𝑦 2𝑃 x, 𝑦 𝑑x𝑑𝑦�

Linear Regression CSL465/603 - Machine Learning 51

Page 52: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (7)• Relevant part of loss:

u 𝑓 x − 𝐶 x 2𝑃 x 𝑑x�

Linear Regression CSL465/603 - Machine Learning 52

Page 53: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (8)• Relevant part of loss:

𝐸n 𝑓 𝑥; 𝐴 − 𝐶 𝑥 2

Linear Regression CSL465/603 - Machine Learning 53

Page 54: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (9)

Degree = 4

Degree = 1

Linear Regression CSL465/603 - Machine Learning 54

Page 55: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (10)

• Bias term of the error – En 𝑓 x; 𝐴 − 𝐶 x 2

• Measures how well our approximation architecture can fit the data

• Weak approximators will have high bias• Example low degree polynomials

• Strong approximators will have low bias• Example high degree polynomials

Linear Regression CSL465/603 - Machine Learning 55

Page 56: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (11)• Variance term of the error –

En 𝑓 x; 𝐴 − En 𝑓 x; 𝐴 2

• No direct dependence on the target value• For a fixed size dataset 𝐴• Strong approximators tend to have more variance

• Small changes in the dataset can result in wide changes in the predictors

• Weak approximators tend to have less variance• Small changes in the dataset result in similar predictors

• Variance disappears as 𝐴 → ∞

Linear Regression CSL465/603 - Machine Learning 56

Page 57: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Bias-Variance Analysis (12)• Measuring Bias and Variance in practice• Bootstrap from the given dataset

• Start with a complex approximator, and reduce the complexity through regularization• Setting more coefficients/parameters to 0

• Do Feature Selection• Reduces variance, but can increase bias.

• Hopefully just sufficient to model the given data

Linear Regression CSL465/603 - Machine Learning 57

Page 58: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Regularization• Central Idea: penalize over-complicated solutions• Linear regression minimizes

; x%𝑤 − 𝑦% 23

%8-

• Regularized regression minimizes

; x%𝑤 − 𝑦% 23

%8-

+ 𝜆 𝑤

Linear Regression CSL465/603 - Machine Learning 58

Page 59: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Modified Solution• Solution for ordinary linear regression

minz𝐽 𝑤 ≡ min

L

12 𝑋𝑤 − 𝑌 N 𝑋𝑤 − 𝑌

𝑤 = 𝑋N𝑋 P-𝑋N𝑌• Now for the regularized version which uses 𝐿2norm

– Ridge Regressionminz𝐽 𝑤 ≡ min

L

12 𝑋𝑤 − 𝑌 N 𝑋𝑤 − 𝑌 + 𝜆 𝑤 2

𝑤 = 𝑋N𝑋 + 𝜆𝐼 P-𝑋N𝑌

Linear Regression CSL465/603 - Machine Learning 59

Exercise: derive the closed for solution for ridge regression with L2 regularizer

Page 60: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

How to choose 𝜆? • Tradeoff between complexity vs. goodness of the fit• Solution 1: If we have lots of data• Generate multiple models• Use lots of test data to discard the bad models

• Solution 2: With limited data• Use k- fold cross validation• Will discuss later

Linear Regression CSL465/603 - Machine Learning 60

Page 61: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

General Form of Regularizer Term

; x%𝑤 − 𝑦% 23

%8-

+ 𝜆; 𝑤D}

?

D8-

• Quadratic/𝐿2 regularizer – 𝑞 = 2• Contours for the regularization term

Linear Regression CSL465/603 - Machine Learning 61

3.1. Linear Basis Function Models 145

q = 0.5 q = 1 q = 2 q = 4

Figure 3.3 Contours of the regularization term in (3.29) for various values of the parameter q.

zero. It has the advantage that the error function remains a quadratic function ofw, and so its exact minimizer can be found in closed form. Specifically, setting thegradient of (3.27) with respect to w to zero, and solving for w as before, we obtain

w =!λI + ΦTΦ

"−1ΦTt. (3.28)

This represents a simple extension of the least-squares solution (3.15).A more general regularizer is sometimes used, for which the regularized error

takes the form12

N#

n=1

{tn − wTφ(xn)}2 +λ

2

M#

j=1

|wj |q (3.29)

where q = 2 corresponds to the quadratic regularizer (3.27). Figure 3.3 shows con-tours of the regularization function for different values of q.

The case of q = 1 is know as the lasso in the statistics literature (Tibshirani,1996). It has the property that if λ is sufficiently large, some of the coefficientswj are driven to zero, leading to a sparse model in which the corresponding basisfunctions play no role. To see this, we first note that minimizing (3.29) is equivalentto minimizing the unregularized sum-of-squares error (3.12) subject to the constraintExercise 3.5

M#

j=1

|wj |q ! η (3.30)

for an appropriate value of the parameter η, where the two approaches can be relatedusing Lagrange multipliers. The origin of the sparsity can be seen from Figure 3.4,Appendix Ewhich shows that the minimum of the error function, subject to the constraint (3.30).As λ is increased, so an increasing number of parameters are driven to zero.

Regularization allows complex models to be trained on data sets of limited sizewithout severe over-fitting, essentially by limiting the effective model complexity.However, the problem of determining the optimal model complexity is then shiftedfrom one of finding the appropriate number of basis functions to one of determininga suitable value of the regularization coefficient λ. We shall return to the issue ofmodel complexity later in this chapter.

Page 62: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Special scenario 𝑞 = 1 - LASSO• Least Absolute Shrinkage and Selection Operator• Error Function: ∑ x%𝑤 − 𝑦% 23

%8- + 𝜆∑ 𝑤D?D8-

• For sufficiently large 𝜆 many of the coefficients become 0 resulting in a sparse solution

Linear Regression CSL465/603 - Machine Learning 62

146 3. LINEAR MODELS FOR REGRESSION

Figure 3.4 Plot of the contoursof the unregularized error function(blue) along with the constraint re-gion (3.30) for the quadratic regular-izer q = 2 on the left and the lassoregularizer q = 1 on the right, inwhich the optimum value for the pa-rameter vector w is denoted by w⋆.The lasso gives a sparse solution inwhich w⋆

1 = 0.

w1

w2

w⋆

w1

w2

w⋆

For the remainder of this chapter we shall focus on the quadratic regularizer(3.27) both for its practical importance and its analytical tractability.

3.1.5 Multiple outputsSo far, we have considered the case of a single target variable t. In some applica-

tions, we may wish to predict K > 1 target variables, which we denote collectivelyby the target vector t. This could be done by introducing a different set of basis func-tions for each component of t, leading to multiple, independent regression problems.However, a more interesting, and more common, approach is to use the same set ofbasis functions to model all of the components of the target vector so that

y(x,w) = WTφ(x) (3.31)

where y is a K-dimensional column vector, W is an M × K matrix of parameters,and φ(x) is an M -dimensional column vector with elements φj(x), with φ0(x) = 1as before. Suppose we take the conditional distribution of the target vector to be anisotropic Gaussian of the form

p(t|x,W, β) = N (t|WTφ(x), β−1I). (3.32)

If we have a set of observations t1, . . . , tN , we can combine these into a matrix Tof size N × K such that the nth row is given by tT

n . Similarly, we can combine theinput vectors x1, . . . ,xN into a matrix X. The log likelihood function is then givenby

ln p(T|X,W, β) =N!

n=1

lnN (tn|WTφ(xn), β−1I)

=NK

2ln

#− β

2

N!

n=1

$$tn − WTφ(xn)$$2

. (3.33)

Page 63: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

LASSO• Quadratic programming to solve the optimization

problem• Least Angles Regression solution - refer to ESL• http://web.stanford.edu/~hastie/glmnet_matlab/ -

matlab packages for LASSO

Linear Regression CSL465/603 - Machine Learning 63

Page 64: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Linear Regression with Non-Linear Basis Functions• Linear combination of fixed non-linear functions of

the input variables

𝑓 x = 𝑤+ +;𝜙D x?

D8-

Linear Regression CSL465/603 - Machine Learning 64

140 3. LINEAR MODELS FOR REGRESSION

−1 0 1−1

−0.5

0

0.5

1

−1 0 10

0.25

0.5

0.75

1

−1 0 10

0.25

0.5

0.75

1

Figure 3.1 Examples of basis functions, showing polynomials on the left, Gaussians of the form (3.4) in thecentre, and sigmoidal of the form (3.5) on the right.

on a regular lattice, such as the successive time points in a temporal sequence, or thepixels in an image. Useful texts on wavelets include Ogden (1997), Mallat (1999),and Vidakovic (1999).

Most of the discussion in this chapter, however, is independent of the particularchoice of basis function set, and so for most of our discussion we shall not specifythe particular form of the basis functions, except for the purposes of numerical il-lustration. Indeed, much of our discussion will be equally applicable to the situationin which the vector φ(x) of basis functions is simply the identity φ(x) = x. Fur-thermore, in order to keep the notation simple, we shall focus on the case of a singletarget variable t. However, in Section 3.1.5, we consider briefly the modificationsneeded to deal with multiple target variables.

3.1.1 Maximum likelihood and least squaresIn Chapter 1, we fitted polynomial functions to data sets by minimizing a sum-

of-squares error function. We also showed that this error function could be motivatedas the maximum likelihood solution under an assumed Gaussian noise model. Letus return to this discussion and consider the least squares approach, and its relationto maximum likelihood, in more detail.

As before, we assume that the target variable t is given by a deterministic func-tion y(x,w) with additive Gaussian noise so that

t = y(x,w) + ϵ (3.7)

where ϵ is a zero mean Gaussian random variable with precision (inverse variance)β. Thus we can write

p(t|x,w, β) = N (t|y(x,w), β−1). (3.8)

Recall that, if we assume a squared loss function, then the optimal prediction, for anew value of x, will be given by the conditional mean of the target variable. In theSection 1.5.5case of a Gaussian conditional distribution of the form (3.8), the conditional mean

Page 65: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Linear Regression with Basis Functions• Solution

𝑓 X =

1 𝜙- x- …𝜙? x-1 𝜙- x2 …𝜙? x2⋮ ⋮ ⋮⋮1 𝜙- x3 …𝜙? x3

𝑤+𝑤-⋮𝑤?

=

𝑦-𝑦2⋮𝑦3

= 𝑌

𝑤 = 𝜙 𝑋 N𝜙 𝑋P-𝜙 𝑋 N𝑌

Linear Regression CSL465/603 - Machine Learning 65

Page 66: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Linear Regression with Multiple Outputs• Multiple outputs

𝑌 = 𝑦-- … 𝑦-�⋮ ⋮ ⋮𝑦3- … 𝑦3�

𝑓 X = X𝑊 =

1 𝑥-- …𝑥-?1 𝑥2- …𝑥2?⋮ ⋮ ⋮⋮1 𝑥3- …𝑥3?

𝑤-+ ⋯ 𝑤�+⋮ ⋯ ⋮

𝑤-? ⋯ 𝑤�?=

𝑦-- … 𝑦-�⋮ ⋮ ⋮𝑦3- … 𝑦3�

= 𝑌

Linear Regression CSL465/603 - Machine Learning 66

𝑊 = 𝑋N𝑋 P-𝑋N𝑌

Page 67: Linear Regression - Indian Institute of Technology Roparcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w5.pdf · Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Summary• Linear Regression (aka curve fitting)• Gradient Descent Approach for finding the solution• Analytical solution• Loss Functions• Probabilistic view of Linear Regression• Bias-Variance analysis• Regularization• Ridge Regression

• Regression with basis functions• Locally weighted regression (refer ML - 8.3)Linear Regression CSL465/603 - Machine Learning 67