Linear Regression - Indian Institute of Technology...

Linear RegressionCSL465/603 - Fall 2016Narayanan C Krishnan

[email protected]

Outline• Univariate regression• Multivariate regression• Probabilistic view of regression• Loss functions• Bias-Variance analysis• Regularization

Linear Regression CSL465/603 - Machine Learning 2

Example - Green Chilies Entertainment Company

Cost of making the film (in crores of Rs)

Earnings from the film (in crores of Rs)


Notations• Training dataset• Number of examples - 𝑁• Input variable - x#• Target variable - 𝑦%

• Goal: Learn function that predicts 𝑦 for new input x

Cost of Film (Crores of Rs) -x

Profit/Loss (Crores of Rs) -y

98.28 199.6940.22 93.6962.07 100.33… …


Linear Regression• Simplest form

𝑓(x) = 𝑤+ + 𝑤-x




Least Mean Squares - Cost Function• Choose parameters 𝑤+and 𝑤-(or w ) so that 𝑓 x is

as close as to 𝑦




Least Mean Squares - Cost Function - Parameter Space (1)• Let

𝐽 𝑤+, 𝑤- = -23∑ 𝑓 𝑥6 − 𝑦6 23%8-



𝐽 𝑤+, 𝑤- = -23∑ 𝑓 𝑥% − 𝑦% 23%8-



𝐽 𝑤+, 𝑤- = -23∑ 𝑓 𝑥6 − 𝑦6 23%8-


Plot of the Error Surface


Contour Plot of Error Surface


Estimating Optimal Parameters


Gradient Descent – Basic Principle

• Minimize 𝐽 w = -23∑ 𝑓 x% − 𝑦%3%8-

2

• Start with an initial estimate for w• Keep changing w so that 𝐽 w is progressively

reduced• Stop when no change or have reached the

minimum


Gradient Descent - Intuition


Effect of Learning Parameter• Too small value – slow

convergence• Too large value –

oscillates widely and may not converge


Gradient Descent – Local Minima• Depending on the function 𝐽 w , gradient descent

can get stuck at local minima


Gradient Descent for Regression• Convex error function

𝐽 w =12𝑁; 𝑓 x# − 𝑦% 2

3

%8-• Geometrically – error surface is bowl shaped.• Only global minima

Exercise – Prove that the sum of squared error is a convex function


Parameter Update (1) • Minimize

𝐽 w =12𝑁; 𝑓 x# − 𝑦% 2

3

%8-


Parameter Update (2)• Repeat till convergence

𝑤+ = 𝑤+ − 𝛼1𝑁; 𝑓 x# − 𝑦%

3

%8-

𝑤- = 𝑤- − 𝛼1𝑁; 𝑓 x# − 𝑦% x#

3

%8-


Example – Iteration 0

Regression Function Error Function



Error FunctionRegression Function


Gradient Descent • Batch Mode• Update includes contribution of all data points

𝑤+ = 𝑤+ − 𝛼1𝑁; 𝑓 x# − 𝑦%

3

%8-

𝑤- = 𝑤- − 𝛼1𝑁; 𝑓 x# − 𝑦% x#

3

%8-• Will talk stochastic gradient descent later (neural

networks).


Multivariate Linear Regression

• Dimension of the input data - 𝐷

Cost of Film (Crores of Rs)

Celebrity status of the protagonist

# of theatres release

Age of the protagonist

Earnings (Crores of Rs) -y

75.72 7.57 32 52 157.3918.74 1.87 16 68 81.9350.96 5.09 27 35 131.95… … … … …


Multivariate Linear Regression -Formulation• Simplest model:

𝑓 x = 𝑤+ + 𝑤-𝑥- + 𝑤2𝑥2 + ⋯+𝑤?𝑥?

• Parameters to learn: 𝑤+,𝑤-, … ,𝑤? = 𝐰

• Cost function: 𝐽 𝐰 = -23∑ 𝑓 x# − 𝑦% 23%8-

• Update equation: 𝑤B = 𝑤B − 𝛼-3∑ 𝑓 x% − 𝑦% 𝑥%B3%8-


Gradient Descent• Parameter update equation

𝑤B = 𝑤B − 𝛼1𝑁; 𝑓 x% − 𝑦% 𝑥%B

3

%8-


Feature Scaling Multivariate Linear Regression (1)

• Transform features to be of same scale






75.72 7.57 32 52 157.3918.74 1.87 16 68 81.9350.96 5.09 27 35 131.95… … … … …


Feature Scaling for Multivariate Linear Regression (2)• Normalization −1 ≤ 𝑥D ≤ 1 or 0 ≤ 𝑥D ≤ 1

• Standardization – mean 0 and standard deviation 1


Multivariate Linear Regression Analytical Solution

• Design Matrix and Target Vector






75.72 7.57 32 52 157.3918.74 1.87 16 68 81.9350.96 5.09 27 35 131.95… … … … …

X =

1 𝑥-- …𝑥-?1 𝑥2- …𝑥2?⋮ ⋮ ⋮⋮1 𝑥3- …𝑥3?

𝑌 =

𝑦-𝑦2⋮𝑦3


Least Squares Method

𝑓 X = X𝑤 =

1 𝑥-- …𝑥-?1 𝑥2- …𝑥2?⋮ ⋮ ⋮⋮1 𝑥3- …𝑥3?

𝑤+𝑤-⋮𝑤?

=

𝑦-𝑦2⋮𝑦3

= 𝑌

𝐽 𝑤 =12; 𝑓 x# − 𝑦% 2

3

%8-


Normal Equations

minL𝐽 𝑊 = min

L12 X𝑊 − 𝑌 N X𝑊 − 𝑌

• Finding the gradient wrt 𝑊 and equate it to 0


Analytical Solution• Advantage• No need for the learning parameter 𝛼!• No need for iterative updates

• Disadvantage• Need to perform matrix inversion

• Pseudo-Inverse of the matrix 𝑋N𝑋 P-𝑋N

• Sometimes we deal with non-invertible matrices (redundant features)


Probabilistic View of Linear Regression (1)• Let 𝑦 = 𝑓(𝑥) + 𝜖• 𝜖 is the error term that captures unmodeled effects

or random noise.• 𝜖~𝒩 0, 𝜎2 - Gaussian distribution


Probabilistic View of Linear Regression (2)• Let 𝑦 = 𝑓(𝑥) + 𝜖• 𝜖 is the error term that captures unmodeled effects

or random noise.• 𝜖~𝒩 0, 𝜎2 - Gaussian distribution - why?• 𝒩 0, 𝜎2 has maximum

entropy among all real-valued distributions with a specified variance 𝜎2

• 3-𝜎rule:


1.5. Decision Theory 47

Figure 1.28 The regression function y(x),which minimizes the expectedsquared loss, is given by themean of the conditional distri-bution p(t|x).

t

xx0

y(x0)

y(x)

p(t|x0)

which is the conditional average of t conditioned on x and is known as the regressionfunction. This result is illustrated in Figure 1.28. It can readily be extended to mul-tiple target variables represented by the vector t, in which case the optimal solutionis the conditional average y(x) = Et[t|x].Exercise 1.25

We can also derive this result in a slightly different way, which will also shedlight on the nature of the regression problem. Armed with the knowledge that theoptimal solution is the conditional expectation, we can expand the square term asfollows

{y(x) − t}2 = {y(x) − E[t|x] + E[t|x] − t}2

= {y(x) − E[t|x]}2 + 2{y(x) − E[t|x]}{E[t|x] − t} + {E[t|x] − t}2

where, to keep the notation uncluttered, we use E[t|x] to denote Et[t|x]. Substitutinginto the loss function and performing the integral over t, we see that the cross-termvanishes and we obtain an expression for the loss function in the form

E[L] =!

{y(x) − E[t|x]}2 p(x) dx +!

{E[t|x] − t}2p(x) dx. (1.90)

The function y(x) we seek to determine enters only in the first term, which will beminimized when y(x) is equal to E[t|x], in which case this term will vanish. Thisis simply the result that we derived previously and that shows that the optimal leastsquares predictor is given by the conditional mean. The second term is the varianceof the distribution of t, averaged over x. It represents the intrinsic variability ofthe target data and can be regarded as noise. Because it is independent of y(x), itrepresents the irreducible minimum value of the loss function.

As with the classification problem, we can either determine the appropriate prob-abilities and then use these to make optimal decisions, or we can build models thatmake decisions directly. Indeed, we can identify three distinct approaches to solvingregression problems given, in order of decreasing complexity, by:

(a) First solve the inference problem of determining the joint density p(x, t). Thennormalize to find the conditional density p(t|x), and finally marginalize to findthe conditional mean given by (1.89).

Probabilistic View of Linear Regression (3)• Let 𝑦 = 𝑓(x) + 𝜖• 𝜖 is the error term that captures unmodeled effects

or random noise.• 𝜖~𝒩 0, 𝜎2 - Gaussian distribution

• Then 𝑃 𝜖 =• And 𝑃 𝑦 x =


Probabilistic View of Linear Regression (4)𝑃 𝑦-, … , 𝑦3 x-, … , x3 =

𝑃 𝑦-, … , 𝑦3 x-, … , x3;𝑊 =


Maximizing the Likelihood• Maximize 𝐿 𝑊 =∏ 𝑃 𝑦%|x%;𝑊3

%8-


Loss Functions• Squared loss 𝑓 𝑥 − 𝑦 2

• Absolute loss 𝑓 𝑥 − 𝑦• Dead band loss max 0, 𝑓 𝑥 − 𝑦 − 𝜖 , 𝜖 ∈ ℛ]


Loss Functions• Problem with squared loss


Linear Regression with Absolute Loss Function• Objective

minL

; X𝑊 − 𝑌3

%8-• Non-differentiable, so cannot take the gradient

descent approach• Solution: frame as a constrained optimization

problem• Introduce new variables 𝑣 ∈ ℛ3, 𝑣% ≥ x%𝑊 − 𝑦%

minL,`

;𝑣%, subjectto𝑣% ≤3

%8-

x%𝑊 − 𝑦% ≤ 𝑣%


Linear Regression withAbsolute Loss Function - Example• LMS output • LP output


Some Additional Notations• Underlying response function (Target Concept) – 𝐶• Actual observed response – 𝑦 = 𝐶 x + 𝜖

𝜖~𝒩 0, 𝜎2 , E 𝑦/x = 𝐶(x)• Predicted response based on the model learned

from dataset 𝐴 - 𝑓 x; 𝐴• Expected response averaged over all datasets –𝑓m x = En 𝑓 x; 𝐴• Expected 𝐿2 error on a new test instance x∗ -Epqq = En 𝑓 x∗; 𝐴 − 𝑦 2


Bias-Variance Analysis (1)


Bias-Variance Analysis (3)• Root Mean Square Error


Bias-Variance Analysis (4)• 9th degree polynomial fit with more sample data


Bias-Variance Analysis (5)• Expected square loss -

E 𝐿 = r 𝑓 x − 𝑦 2𝑃 x, 𝑦 𝑑x𝑑𝑦�

�


Bias-Variance Analysis (6)• Expected square loss -

E 𝐿 = r 𝑓 x − 𝑦 2𝑃 x, 𝑦 𝑑x𝑑𝑦�

�


Bias-Variance Analysis (7)• Relevant part of loss:

u 𝑓 x − 𝐶 x 2𝑃 x 𝑑x�

�


Bias-Variance Analysis (8)• Relevant part of loss:

𝐸n 𝑓 𝑥; 𝐴 − 𝐶 𝑥 2



Degree = 4

Degree = 1



• Bias term of the error – En 𝑓 x; 𝐴 − 𝐶 x 2

• Measures how well our approximation architecture can fit the data

• Weak approximators will have high bias• Example low degree polynomials

• Strong approximators will have low bias• Example high degree polynomials


Bias-Variance Analysis (11)• Variance term of the error –

En 𝑓 x; 𝐴 − En 𝑓 x; 𝐴 2

• No direct dependence on the target value• For a fixed size dataset 𝐴• Strong approximators tend to have more variance

• Small changes in the dataset can result in wide changes in the predictors

• Weak approximators tend to have less variance• Small changes in the dataset result in similar predictors

• Variance disappears as 𝐴 → ∞


Bias-Variance Analysis (12)• Measuring Bias and Variance in practice• Bootstrap from the given dataset

• Start with a complex approximator, and reduce the complexity through regularization• Setting more coefficients/parameters to 0

• Do Feature Selection• Reduces variance, but can increase bias.

• Hopefully just sufficient to model the given data


Regularization• Central Idea: penalize over-complicated solutions• Linear regression minimizes

; x%𝑤 − 𝑦% 23

%8-

• Regularized regression minimizes

; x%𝑤 − 𝑦% 23

%8-

+ 𝜆 𝑤


Modified Solution• Solution for ordinary linear regression

minz𝐽 𝑤 ≡ min

L

12 𝑋𝑤 − 𝑌 N 𝑋𝑤 − 𝑌

𝑤 = 𝑋N𝑋 P-𝑋N𝑌• Now for the regularized version which uses 𝐿2norm

– Ridge Regressionminz𝐽 𝑤 ≡ min

L

12 𝑋𝑤 − 𝑌 N 𝑋𝑤 − 𝑌 + 𝜆 𝑤 2

𝑤 = 𝑋N𝑋 + 𝜆𝐼 P-𝑋N𝑌


Exercise: derive the closed for solution for ridge regression with L2 regularizer

How to choose 𝜆? • Tradeoff between complexity vs. goodness of the fit• Solution 1: If we have lots of data• Generate multiple models• Use lots of test data to discard the bad models

• Solution 2: With limited data• Use k- fold cross validation• Will discuss later


General Form of Regularizer Term

; x%𝑤 − 𝑦% 23

%8-

+ 𝜆; 𝑤D}

?

D8-

• Quadratic/𝐿2 regularizer – 𝑞 = 2• Contours for the regularization term


3.1. Linear Basis Function Models 145

q = 0.5 q = 1 q = 2 q = 4

Figure 3.3 Contours of the regularization term in (3.29) for various values of the parameter q.

zero. It has the advantage that the error function remains a quadratic function ofw, and so its exact minimizer can be found in closed form. Specifically, setting thegradient of (3.27) with respect to w to zero, and solving for w as before, we obtain

w =!λI + ΦTΦ

"−1ΦTt. (3.28)

This represents a simple extension of the least-squares solution (3.15).A more general regularizer is sometimes used, for which the regularized error

takes the form12

N#

n=1

{tn − wTφ(xn)}2 +λ

2

M#

j=1

|wj |q (3.29)

where q = 2 corresponds to the quadratic regularizer (3.27). Figure 3.3 shows con-tours of the regularization function for different values of q.

The case of q = 1 is know as the lasso in the statistics literature (Tibshirani,1996). It has the property that if λ is sufficiently large, some of the coefficientswj are driven to zero, leading to a sparse model in which the corresponding basisfunctions play no role. To see this, we first note that minimizing (3.29) is equivalentto minimizing the unregularized sum-of-squares error (3.12) subject to the constraintExercise 3.5

M#

j=1

|wj |q ! η (3.30)

for an appropriate value of the parameter η, where the two approaches can be relatedusing Lagrange multipliers. The origin of the sparsity can be seen from Figure 3.4,Appendix Ewhich shows that the minimum of the error function, subject to the constraint (3.30).As λ is increased, so an increasing number of parameters are driven to zero.

Regularization allows complex models to be trained on data sets of limited sizewithout severe over-fitting, essentially by limiting the effective model complexity.However, the problem of determining the optimal model complexity is then shiftedfrom one of finding the appropriate number of basis functions to one of determininga suitable value of the regularization coefficient λ. We shall return to the issue ofmodel complexity later in this chapter.

Special scenario 𝑞 = 1 - LASSO• Least Absolute Shrinkage and Selection Operator• Error Function: ∑ x%𝑤 − 𝑦% 23

%8- + 𝜆∑ 𝑤D?D8-

• For sufficiently large 𝜆 many of the coefficients become 0 resulting in a sparse solution


146 3. LINEAR MODELS FOR REGRESSION

Figure 3.4 Plot of the contoursof the unregularized error function(blue) along with the constraint re-gion (3.30) for the quadratic regular-izer q = 2 on the left and the lassoregularizer q = 1 on the right, inwhich the optimum value for the pa-rameter vector w is denoted by w⋆.The lasso gives a sparse solution inwhich w⋆

1 = 0.

w1

w2

w⋆

w1

w2

w⋆

For the remainder of this chapter we shall focus on the quadratic regularizer(3.27) both for its practical importance and its analytical tractability.

3.1.5 Multiple outputsSo far, we have considered the case of a single target variable t. In some applica-

tions, we may wish to predict K > 1 target variables, which we denote collectivelyby the target vector t. This could be done by introducing a different set of basis func-tions for each component of t, leading to multiple, independent regression problems.However, a more interesting, and more common, approach is to use the same set ofbasis functions to model all of the components of the target vector so that

y(x,w) = WTφ(x) (3.31)

where y is a K-dimensional column vector, W is an M × K matrix of parameters,and φ(x) is an M -dimensional column vector with elements φj(x), with φ0(x) = 1as before. Suppose we take the conditional distribution of the target vector to be anisotropic Gaussian of the form

p(t|x,W, β) = N (t|WTφ(x), β−1I). (3.32)

If we have a set of observations t1, . . . , tN , we can combine these into a matrix Tof size N × K such that the nth row is given by tT

n . Similarly, we can combine theinput vectors x1, . . . ,xN into a matrix X. The log likelihood function is then givenby

ln p(T|X,W, β) =N!

n=1

lnN (tn|WTφ(xn), β−1I)

=NK

2ln

"β

2π

#− β

2

N!

n=1

$$tn − WTφ(xn)$$2

. (3.33)

LASSO• Quadratic programming to solve the optimization

problem• Least Angles Regression solution - refer to ESL• http://web.stanford.edu/~hastie/glmnet_matlab/ -

matlab packages for LASSO


Linear Regression with Non-Linear Basis Functions• Linear combination of fixed non-linear functions of

the input variables

𝑓 x = 𝑤+ +;𝜙D x?

D8-


140 3. LINEAR MODELS FOR REGRESSION

−1 0 1−1

−0.5

0

0.5

1

−1 0 10

0.25

0.5

0.75

1

−1 0 10

0.25

0.5

0.75

1

Figure 3.1 Examples of basis functions, showing polynomials on the left, Gaussians of the form (3.4) in thecentre, and sigmoidal of the form (3.5) on the right.

on a regular lattice, such as the successive time points in a temporal sequence, or thepixels in an image. Useful texts on wavelets include Ogden (1997), Mallat (1999),and Vidakovic (1999).

Most of the discussion in this chapter, however, is independent of the particularchoice of basis function set, and so for most of our discussion we shall not specifythe particular form of the basis functions, except for the purposes of numerical il-lustration. Indeed, much of our discussion will be equally applicable to the situationin which the vector φ(x) of basis functions is simply the identity φ(x) = x. Fur-thermore, in order to keep the notation simple, we shall focus on the case of a singletarget variable t. However, in Section 3.1.5, we consider briefly the modificationsneeded to deal with multiple target variables.

3.1.1 Maximum likelihood and least squaresIn Chapter 1, we fitted polynomial functions to data sets by minimizing a sum-

of-squares error function. We also showed that this error function could be motivatedas the maximum likelihood solution under an assumed Gaussian noise model. Letus return to this discussion and consider the least squares approach, and its relationto maximum likelihood, in more detail.

As before, we assume that the target variable t is given by a deterministic func-tion y(x,w) with additive Gaussian noise so that

t = y(x,w) + ϵ (3.7)

where ϵ is a zero mean Gaussian random variable with precision (inverse variance)β. Thus we can write

p(t|x,w, β) = N (t|y(x,w), β−1). (3.8)

Recall that, if we assume a squared loss function, then the optimal prediction, for anew value of x, will be given by the conditional mean of the target variable. In theSection 1.5.5case of a Gaussian conditional distribution of the form (3.8), the conditional mean

Linear Regression with Basis Functions• Solution

𝑓 X =

1 𝜙- x- …𝜙? x-1 𝜙- x2 …𝜙? x2⋮ ⋮ ⋮⋮1 𝜙- x3 …𝜙? x3

𝑤+𝑤-⋮𝑤?

=

𝑦-𝑦2⋮𝑦3

= 𝑌

𝑤 = 𝜙 𝑋 N𝜙 𝑋P-𝜙 𝑋 N𝑌


Linear Regression with Multiple Outputs• Multiple outputs

𝑌 = 𝑦-- … 𝑦-�⋮ ⋮ ⋮𝑦3- … 𝑦3�

𝑓 X = X𝑊 =

1 𝑥-- …𝑥-?1 𝑥2- …𝑥2?⋮ ⋮ ⋮⋮1 𝑥3- …𝑥3?

𝑤-+ ⋯ 𝑤�+⋮ ⋯ ⋮

𝑤-? ⋯ 𝑤�?=

𝑦-- … 𝑦-�⋮ ⋮ ⋮𝑦3- … 𝑦3�

= 𝑌


𝑊 = 𝑋N𝑋 P-𝑋N𝑌

Summary• Linear Regression (aka curve fitting)• Gradient Descent Approach for finding the solution• Analytical solution• Loss Functions• Probabilistic view of Linear Regression• Bias-Variance analysis• Regularization• Ridge Regression

• Regression with basis functions• Locally weighted regression (refer ML - 8.3)Linear Regression CSL465/603 - Machine Learning 67

Linear Regression - Indian Institute of Technology...

Documents

Transcript of Linear Regression - Indian Institute of Technology...