Linear Models for Regression using Basis...

25
1 Machine Learning Srihari Linear Models for Regression using Basis Functions Sargur Srihari [email protected]

Transcript of Linear Models for Regression using Basis...

Page 1: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

1

Machine Learning Srihari

Linear Models for Regressionusing Basis Functions

Sargur [email protected]

Administrator
Rectangle
Administrator
Rectangle
Page 2: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

2

Machine Learning Srihari

Overview• Supervised learning starting

with regression• Goal: predict value of one or

more target variables t given value of D-dimensional vector xof input variables

• When t is continuous valued we call it regression, if t has a value consisting of labels it is called classification

• Polynomial is a specific example of curve fitting called linear regression models

Administrator
Rectangle
Page 3: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

3

Machine Learning Srihari

Types of Linear Regression Models

• Simplest form of linear regression models:– linear functions of input variables

• More useful class of functions:– linear combination of non-linear functions of

input variables called basis functions• They are linear functions of parameters

(which gives them simple analytical properties), yet are nonlinear with respect to input variables

Administrator
Rectangle
Page 4: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

4

Machine Learning Srihari

Task at Hand• Predict value of continuous target variable t given

value of D-dimensional input variable x– t can also be a set of variables

• Seen earlier– Polynomial– Input variable scalar

• This discussion– Linear functions of adjustable parameters– Specifically linear combinations of nonlinear functions

of input variable

TargetVariable

Input Variable

Administrator
Rectangle
Page 5: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

5

Machine Learning Srihari

Probabilistic Formulation• Given:

– Data set of n observations {xn}, n=1,.., N– Corresponding target values {tn}

• Goal: – Predict value of t for a new value of x

• Simplest approach: – Construct function y(x) whose values are the predictions

• Probabilistic Approach:– Model predictive distribution p(t|x)

• It expresses uncertainty about value of t for each x– From this conditional distribution

• Predict t for any value of x so as to minimize a loss function• Typical loss is squared loss for which the solution is the conditional

expectation of t

Administrator
Rectangle
Page 6: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

6

Machine Learning Srihari

Linear Regression Model• Simplest linear model for regression with D input variables

y(x,w) = w0+w1x1+..+wD xD

where x=(x1,..,xD)T are the input variables

• Called Linear Regression since it is a linear function of – parameters w0,..,wD

– input variables

Administrator
Rectangle
Page 7: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

7

Machine Learning Srihari

Linear Basis Function Models• Extended by considering nonlinear functions of input

variables

– where φj(x) are called Basis functions– There are now M parameters instead of D parameters– Can be written as

∑−

=

+=1

10 )x()wx,(

M

jjjwwy φ

)x(φw)x()wx,(1

0

TM

jjjwy == ∑

=

φ

Administrator
Rectangle
Page 8: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

8

Machine Learning Srihari

Some Basis Functions• Linear Basis Function Model

• Polynomial– In polynomial regression seen earlier, there is a single

variable x and φj(x)=x j with degree M polynomial– Disadvantage

• Global• Gaussian

• Sigmoid

• Tan h

)x(φw)x()wx,(1

0

TM

jjjwy == ∑

=

φ

)exp(11 where

aσ(a)

sx j

j −+=⎟⎟

⎞⎜⎜⎝

⎛ −=

µσφ

⎟⎟⎠

⎞⎜⎜⎝

⎛ −= 2

2

2)(

exps

x jj

µφ

LogisticSigmoid

1)(2)tanh( −= aa σ

Administrator
Rectangle
Page 9: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

9

Machine Learning Srihari

Other Basis Functions

• Fourier– Expansion in sinusoidal functions– Infinite spatial extent

• Signal Processing– Functions localized in time and frequency– Called wavelets

• Useful for lattices such as images and time series

• Further discussion independent of choice of basis

Administrator
Rectangle
Page 10: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

10

Machine Learning Srihari

Maximum Likelihood Formulation

• Target variable t given by deterministic function y(x,w) with additive Gaussian noiset = y(x,w)+ε

ε is zero-mean Gaussian with precision β• Thus distribution of t is normal:

p(t|x,w,β)=N(t|y(x,w),β -1)mean variance

Administrator
Rectangle
Page 11: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

11

Machine Learning Srihari

Likelihood Function• Data set X={x1,..xN} with values t = {t1,..tN}• Likelihood

• Log-likelihood

– Where

– Called Sum-of-squares Error Function• Maximizing Likelihood with Gaussian noise is

equivalent to minimizing ED(w)

( )∏=

−=N

nn

TntNXp

1

1),x(w|),w,|t( βφβ

( )

)w(2ln2

ln2

),x(w|ln),w,|t(ln1

1

D

N

nn

Tn

ENN

tNXp

βπβ

βφβ

−−=

= ∏=

{ }2

1

)(w 21)w( ∑

=

−=N

nn

TnD xtE φ

Administrator
Rectangle
Page 12: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

12

Machine Learning Srihari

Maximum Likelihood for weight parameter w

• Gradient of log-likelihood wrt w

• Setting to zero and solving for w

– Where is the Moore-Penrose pseudo inverse of N x M Design Matrix

{ } Tn

N

nn

Tn xtXp )x( )(w ),w,|t(ln

1φφβ ∑

=

−=∇

tw +Φ=MLTT ΦΦΦ=Φ −+ 1)(

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

φφ

φφφφ

Administrator
Rectangle
Page 13: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

13

Machine Learning Srihari

Maximum Likelihood for precision β

• Similarly gradient wrt β gives

{ }∑=

−=N

nn

TMLn

ML

xtN 1

2 )(w 11 φβ

Residual variance of thetarget values around theregression function

Administrator
Rectangle
Page 14: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

14

Machine Learning Srihari

Geometry of Least Squares Solutionsubspace

• N-dimensional space (target values of N data points)

• Axes given by tn• Each basis function φj(xn),

corresponding to jth column of Φ, is represented in this space

• If M<N then φj(xn) are in a subspace Sof dimensionality M

• Solution y is choice of y that lies in subspace S that is closest to t– Corresponds to orthogonal projection of t

onto S• When two or more basis functions are

collinear, ΦTΦ is singular– Singular Value Decomposition is used

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

φφ

φφφφ

M Basis functions

ND

ata

Administrator
Rectangle
Page 15: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

15

Machine Learning Srihari

Sequential Learning

• Batch techniques for m.l.e. can be computationally expensive for large data sets

• If error function is a sum over n data points E=ΣnEn then update parameter vector w using

• For Sum-of-squares Error Function

• Known as Least Mean Squares Algorithm

nE∇−=+ ηττ )()1( ww

)x())x(w(ww )()()1(nnnt φφη τττ −+=+

Administrator
Rectangle
Page 16: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

16

Machine Learning Srihari

Regularized Least Squares• Adding regularization term to error controls

over-fitting– Where λ is the regularization coefficient that

controls importance of data-dependent error ED(w)and the regularization term EW(w)

– Simple form of regularizer• Total error function becomes

• Called weight decay because in sequential learning weight values decay towards zero unless supported by data

)w()w( WD EE λ+

ww21)w( T

WE =

{ } ww2

)(w 21

1

2 TN

nn

Tn xt λφ +−∑

=

Administrator
Rectangle
Page 17: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

17

Machine Learning Srihari

More general regularizer• Regularized Error

• Where q=2 corresponds to the quadratic regularizer• q=1 is known as lasso• Regularization allows complex models to be trained

on small data sets without severe overfitting• Contours of regularization term: |wj|q

{ } ∑∑==

+−M

j

qj

N

nn

Tn wt

11

2 ||2

)x(w 21 λφ

Administrator
Rectangle
Page 18: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

20

Machine Learning Srihari

Height of Emperor of ChinaTrue height is 200 (measured in cm, about 6’6”).

Poll a random American: ask “How tall is the emperor?”We want to determine how wrong they are, on average

• Scenario 1• Every American

believes it is 180• The answer is

always 180• The error is always -

20• Average squared

error is 400• Average error is 20

• Scenario 2• Americans have normally

distributed beliefs with mean 180 and standard deviation 10

• Poll two Americans. One says 190 and other 170

• Bias Errors are -10 and -30– Average bias error is -20

• Squared errors: 100 and 900– Ave squared error: 500

• 500 = 400 + 100

• Scenario 3• Americans have normally

distributed beliefs with mean 180 and standard deviation 20

• Poll two: One says 200 and other 160

• Errors: 0 and -40– Ave error is -20

• Squared errors: 0 and 1600

– Ave squared error: 800• 800 = 400 + 400

200180

200180

200180

Average Squared Error (500) = Square of bias (-20) + variance (100)Total squared error = square of bias error + variance

Administrator
Rectangle
Page 19: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

21

Machine Learning Srihari

Bias and Variance Formulation• y(x): estimate of the value of t for input x• h(x): optimal prediction

• If we assume loss function L(t,y(x))={y(x)-t}2

• E[L] can be written asexpected loss = (bias)2 + variance + noise

• where

∫== dtttptEh )x|(]x|[)x(

[ ]{ } dtdtpth

dxpDyEDyE

dxphDyE

DD

D

x),x()x(noise

)x()]};x([)];x({variance

)x()}x()];x([{)bias(

2

2

22

∫∫

−=

−=

−=

Page 20: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

22

Machine Learning Srihari

Dependence of Bias-Variance on Model Complexity• h(x)=sin(2πx)• Regularization

parameter λ• L=100 data sets• Each has N=25 data

points• 24 Gaussian Basis

functions– No of parameters

M=25

LowVarianceHigh bias

HighVarianceLow bias

High λ

Low λ

20 Fits for 25 data points each

Red: Average of FitsGreen: Sinusoid from which data was generated

Result of averaging multiplesolutions with complex model gives good fit

Weighted averaging of multiplesolutions is at heart of Bayesianapproach: not wrt multiple data sets but wrt posterior distributionof parameters

Page 21: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

24

Machine Learning Srihari

Bayesian Linear Regression

• Prior probability distribution over model parameters w

• Assume precision β is known• Since Likelihood function p(t|w) with Gaussian

noise has an exponential form– Conjugate prior is given by Gaussian

p(w)=N(w|m0,S0) with mean m0 and covariance S0

Administrator
Rectangle
Administrator
Rectangle
Page 22: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

25

Machine Learning Srihari

Posterior Distribution of Parameters

• Given by product of likelihood function and prior– p(w|D) =p(D|w)p(w)/p(D)

• Due to choice of conjugate Gaussian prior, posterior is also Gaussian

• Posterior can be written directly in the formp(w|t)=N(w|mN,SN) wheremN=SN(S0

-1m0+βΦTt), SN-1=S0

-1+βΦTΦ

Administrator
Rectangle
Administrator
Rectangle
Page 23: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

26

Machine Learning Srihari

Bayesian Linear Regression Example

• Straight line fitting• Single input variable x• Single target variable t• Linear model y(x,w) = w0+w1x

– t and y used interchangeably here• Since there are only two parameters

– We can plot prior and posterior distributions in parameter space

x

y

Administrator
Rectangle
Administrator
Rectangle
Page 24: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

27

Machine Learning SrihariSequential Bayesian Learning• Synthetic data

generated from f(x,a)=a0+a1x withparameter values a0=-0.3 and a1=0.5

• By first choosing xnfrom U(x|-1,1), then evaluating f(xn,a)and then adding Gaussian noise with std dev 0.2 to obtain target values tn

• Goal is to recover values of a0 and a1

Before datapoints observed

After first datapoint (x,t)observed

Likelihoodfor 2nd point alone

Likelihood for20th pointalone

OneDataPoint(x,t)

Likelihoodp(t|x.w) asfunction

of w=(w0,w1)

Prior/Posterior

p(w)gives p(w|t)

Six samples (regression functions)corresponding to y(x,w)with w drawn fromposterior

X

TwoDataPoints

With infinite points posterior is a deltafunction centered at true parameters(white cross)

True parameterValue

NoDataPoint

TwentyDataPoints

Administrator
Rectangle
Page 25: Linear Models for Regression using Basis Functionscv.znu.ac.ir/afsharchim/lectures/LinearRegressionModels1.pdf2 Machine Learning Srihari Overview • Supervised learning starting with

28

Machine Learning Srihari

Predictive Distribution

• We are usually not interested in the value of w itself

• But predicting t for new values of x• We evaluate the predictive distribution

)x()x(1)x( where

))x(),x(|(),,,x|(

2

2

φφβ

σ

σφβα

NT

N

NTN

S

mtNttp

+=

=

Noise in data Uncertainty associated with parameters w

Administrator
Rectangle
Administrator
Rectangle