Linear Models for Regression using Basis...

Machine Learning Srihari

Linear Models for Regressionusing Basis Functions

Sargur Sriharisrihari@cedar.buffalo.edu

Overview• Supervised learning starting

with regression• Goal: predict value of one or

more target variables t given value of D-dimensional vector xof input variables

• When t is continuous valued we call it regression, if t has a value consisting of labels it is called classification

• Polynomial is a specific example of curve fitting called linear regression models

Types of Linear Regression Models

• Simplest form of linear regression models:– linear functions of input variables

• More useful class of functions:– linear combination of non-linear functions of

input variables called basis functions• They are linear functions of parameters

(which gives them simple analytical properties), yet are nonlinear with respect to input variables

Task at Hand• Predict value of continuous target variable t given

value of D-dimensional input variable x– t can also be a set of variables

• Seen earlier– Polynomial– Input variable scalar

• This discussion– Linear functions of adjustable parameters– Specifically linear combinations of nonlinear functions

of input variable

TargetVariable

Input Variable

Probabilistic Formulation• Given:

– Data set of n observations {xn}, n=1,.., N– Corresponding target values {tn}

• Goal: – Predict value of t for a new value of x

• Simplest approach: – Construct function y(x) whose values are the predictions

• Probabilistic Approach:– Model predictive distribution p(t|x)

• It expresses uncertainty about value of t for each x– From this conditional distribution

• Predict t for any value of x so as to minimize a loss function• Typical loss is squared loss for which the solution is the conditional

expectation of t

Linear Regression Model• Simplest linear model for regression with D input variables

y(x,w) = w0+w1x1+..+wD xD

where x=(x1,..,xD)T are the input variables

• Called Linear Regression since it is a linear function of – parameters w0,..,wD

– input variables

Linear Basis Function Models• Extended by considering nonlinear functions of input

variables

– where φj(x) are called Basis functions– There are now M parameters instead of D parameters– Can be written as

∑−

10 )x()wx,(

jjjwwy φ

)x(φw)x()wx,(1

jjjwy == ∑

Some Basis Functions• Linear Basis Function Model

• Polynomial– In polynomial regression seen earlier, there is a single

variable x and φj(x)=x j with degree M polynomial– Disadvantage

• Global• Gaussian

• Sigmoid

• Tan h

)x(φw)x()wx,(1

jjjwy == ∑

)exp(11 where

aσ(a)

j −+=⎟⎟

⎞⎜⎜⎝

⎛ −=

µσφ

⎟⎟⎠

⎞⎜⎜⎝

⎛ −= 2

LogisticSigmoid

1)(2)tanh( −= aa σ

Other Basis Functions

• Fourier– Expansion in sinusoidal functions– Infinite spatial extent

• Signal Processing– Functions localized in time and frequency– Called wavelets

• Useful for lattices such as images and time series

• Further discussion independent of choice of basis

Maximum Likelihood Formulation

• Target variable t given by deterministic function y(x,w) with additive Gaussian noiset = y(x,w)+ε

ε is zero-mean Gaussian with precision β• Thus distribution of t is normal:

p(t|x,w,β)=N(t|y(x,w),β -1)mean variance

Likelihood Function• Data set X={x1,..xN} with values t = {t1,..tN}• Likelihood

• Log-likelihood

– Where

– Called Sum-of-squares Error Function• Maximizing Likelihood with Gaussian noise is

equivalent to minimizing ED(w)

( )∏=

TntNXp

1),x(w|),w,|t( βφβ

)w(2ln2

),x(w|ln),w,|t(ln1

βπβ

βφβ

−−=

= ∏=

)(w 21)w( ∑

TnD xtE φ

Maximum Likelihood for weight parameter w

• Gradient of log-likelihood wrt w

• Setting to zero and solving for w

– Where is the Moore-Penrose pseudo inverse of N x M Design Matrix

{ } Tn

Tn xtXp )x( )(w ),w,|t(ln

1φφβ ∑

−=∇

tw +Φ=MLTT ΦΦΦ=Φ −+ 1)(

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

)x()x(

)x()x(...)x()x(

111110

φφφφ

Maximum Likelihood for precision β

• Similarly gradient wrt β gives

{ }∑=

2 )(w 11 φβ

Residual variance of thetarget values around theregression function

Geometry of Least Squares Solutionsubspace

• N-dimensional space (target values of N data points)

• Axes given by tn• Each basis function φj(xn),

corresponding to jth column of Φ, is represented in this space

• If M<N then φj(xn) are in a subspace Sof dimensionality M

• Solution y is choice of y that lies in subspace S that is closest to t– Corresponds to orthogonal projection of t

onto S• When two or more basis functions are

collinear, ΦTΦ is singular– Singular Value Decomposition is used

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

)x()x(

)x()x(...)x()x(

111110

φφφφ

M Basis functions

Sequential Learning

• Batch techniques for m.l.e. can be computationally expensive for large data sets

• If error function is a sum over n data points E=ΣnEn then update parameter vector w using

• For Sum-of-squares Error Function

• Known as Least Mean Squares Algorithm

nE∇−=+ ηττ )()1( ww

)x())x(w(ww )()()1(nnnt φφη τττ −+=+

Regularized Least Squares• Adding regularization term to error controls

over-fitting– Where λ is the regularization coefficient that

controls importance of data-dependent error ED(w)and the regularization term EW(w)

– Simple form of regularizer• Total error function becomes

• Called weight decay because in sequential learning weight values decay towards zero unless supported by data

)w()w( WD EE λ+

ww21)w( T

{ } ww2

)(w 21

Tn xt λφ +−∑

More general regularizer• Regularized Error

• Where q=2 corresponds to the quadratic regularizer• q=1 is known as lasso• Regularization allows complex models to be trained

on small data sets without severe overfitting• Contours of regularization term: |wj|q

{ } ∑∑==

)x(w 21 λφ

Height of Emperor of ChinaTrue height is 200 (measured in cm, about 6’6”).

Poll a random American: ask “How tall is the emperor?”We want to determine how wrong they are, on average

• Scenario 1• Every American

believes it is 180• The answer is

always 180• The error is always -

20• Average squared

error is 400• Average error is 20

• Scenario 2• Americans have normally

distributed beliefs with mean 180 and standard deviation 10

• Poll two Americans. One says 190 and other 170

• Bias Errors are -10 and -30– Average bias error is -20

• Squared errors: 100 and 900– Ave squared error: 500

• 500 = 400 + 100

• Scenario 3• Americans have normally

distributed beliefs with mean 180 and standard deviation 20

• Poll two: One says 200 and other 160

• Errors: 0 and -40– Ave error is -20

• Squared errors: 0 and 1600

– Ave squared error: 800• 800 = 400 + 400

200180

Average Squared Error (500) = Square of bias (-20) + variance (100)Total squared error = square of bias error + variance

Bias and Variance Formulation• y(x): estimate of the value of t for input x• h(x): optimal prediction

• If we assume loss function L(t,y(x))={y(x)-t}2

• E[L] can be written asexpected loss = (bias)2 + variance + noise

• where

∫== dtttptEh )x|(]x|[)x(

[ ]{ } dtdtpth

dxpDyEDyE

dxphDyE

x),x()x(noise

)x()]};x([)];x({variance

)x()}x()];x([{)bias(

∫∫

Dependence of Bias-Variance on Model Complexity• h(x)=sin(2πx)• Regularization

parameter λ• L=100 data sets• Each has N=25 data

points• 24 Gaussian Basis

functions– No of parameters

LowVarianceHigh bias

HighVarianceLow bias

High λ

Low λ

20 Fits for 25 data points each

Red: Average of FitsGreen: Sinusoid from which data was generated

Result of averaging multiplesolutions with complex model gives good fit

Weighted averaging of multiplesolutions is at heart of Bayesianapproach: not wrt multiple data sets but wrt posterior distributionof parameters

Bayesian Linear Regression

• Prior probability distribution over model parameters w

• Assume precision β is known• Since Likelihood function p(t|w) with Gaussian

noise has an exponential form– Conjugate prior is given by Gaussian

p(w)=N(w|m0,S0) with mean m0 and covariance S0

Posterior Distribution of Parameters

• Given by product of likelihood function and prior– p(w|D) =p(D|w)p(w)/p(D)

• Due to choice of conjugate Gaussian prior, posterior is also Gaussian

• Posterior can be written directly in the formp(w|t)=N(w|mN,SN) wheremN=SN(S0

-1m0+βΦTt), SN-1=S0

-1+βΦTΦ

Bayesian Linear Regression Example

• Straight line fitting• Single input variable x• Single target variable t• Linear model y(x,w) = w0+w1x

– t and y used interchangeably here• Since there are only two parameters

– We can plot prior and posterior distributions in parameter space

Machine Learning SrihariSequential Bayesian Learning• Synthetic data

generated from f(x,a)=a0+a1x withparameter values a0=-0.3 and a1=0.5

• By first choosing xnfrom U(x|-1,1), then evaluating f(xn,a)and then adding Gaussian noise with std dev 0.2 to obtain target values tn

• Goal is to recover values of a0 and a1

Before datapoints observed

After first datapoint (x,t)observed

Likelihoodfor 2nd point alone

Likelihood for20th pointalone

OneDataPoint(x,t)

Likelihoodp(t|x.w) asfunction

of w=(w0,w1)

Prior/Posterior

p(w)gives p(w|t)

Six samples (regression functions)corresponding to y(x,w)with w drawn fromposterior

TwoDataPoints

With infinite points posterior is a deltafunction centered at true parameters(white cross)

True parameterValue

NoDataPoint

TwentyDataPoints

Predictive Distribution

• We are usually not interested in the value of w itself

• But predicting t for new values of x• We evaluate the predictive distribution

)x()x(1)x( where

))x(),x(|(),,,x|(

φφβ

σφβα

mtNttp

Noise in data Uncertainty associated with parameters w

Linear Models for Regression using Basis...

Documents

Transcript of Linear Models for Regression using Basis...

Machine Learning: Generative and Discriminative Modelssrihari/CSE574/Discriminative-Generative.pdf · Machine Learning Srihari 8 ML Methodologies are increasingly statistical •

3 Probability Theory - Lee Gilesclgiles.ist.psu.edu/IST597/materials/slides/lect2/3... · Deep Learning Srihari Topics in Probability and Information Theory

Deep Learning Srihari Common Probability Distributionsclgiles.ist.psu.edu/IST597/materials/slides/lect2/... · Deep Learning Multinoulli Distribution Srihari • Distribution over

Srihari Sadagoparamanujam. Agenda IntroductionCharacteristicsCYC.

Machine Learning Basics: Supervised Learning Algorithmssrihari/CSE676/5.7 MLBasics-Supervised.pdfDeep Learning Srihari 1 Machine Learning Basics: Supervised Learning Algorithms Sargur

Software Frameworks for Machine Learning - cedar.buffalo.edusrihari/CSE676/1.4.1 SoftwareLibraries.pdf · Software Frameworks for Machine Learning Sargur N. Srihari ... C++ or Java

A Multi-agent reinforcement learning algorithm with fuzzy ...cv.znu.ac.ir/afsharchim/pub/JofIFS2019.pdf · In [14], a multi-agent reinforcement learning method on the base of Q-learning

Sargur Srihari srihari@cedar.buffalosrihari/CSE674/Chap1/1.0-ProbabilisticAI.pdfProbabilistic AI Srihari Role of Probability in AI •Probabilistic computing allows us to 1.Deal with

Machine Learning Srihari - University at Buffalosrihari/CSE574/Chap6/... · Machine Learning Srihari Role of Gaussian Processes 1. As a kernel method • Duality leads to a non-probabilistic

Basic Sampling Methods - University at Buffalosrihari/CSE574/Chap11/Ch11.2...Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu Machine Learning Srihari 2 Topics 1.Motivation

srihari@cedar.buffalosrihari/CSE676/12.5... · 2020. 4. 15. · 5.Education: •Personalizing education, recommend materials 3. Deep Learning Srihari Commercial importance •Internet

Semi-Supervised Disentangling of Causal Factorssrihari/CSE676/15.3... · 2018-04-12 · Deep Learning Srihari Representations using Deep Learning 3 Embedding words and images in a

srihari@cedar.buffalosrihari/CSE676/8.3 BasicOptimizn.pdfBasic Optimization Algorithms Sargur N. Srihari ... •Approximate second-order methods •Optimization strategies and meta-algorithms

Machine Learning: Generative and Discriminative Modelssrihari/CSE574/Discriminative... · Machine Learning Srihari 6. Other Applications of Machine Learning • Recognizing spoken

Discrete Probability Distributions - University at Buffalo Learning Srihari Sample Matlab Code Probability Distributions • Binomial Distribution: – Probability Density Function

Hidden Markov Modelssrihari/CSE574/Chap13/13.2-HMMs.pdf · Hidden Markov Models Sargur Srihari srihari@cedar.buffalo.edu. HMM Overview Machine Learning Srihari 1. Role of HMMs in

Sargur Srihari srihari@cedar.buffalosrihari/CSE674/Chap3/3.6-ConditionalIndependence.pdfMachine Learning Srihari 1 Conditional Independence Sargur Srihari srihari@cedar.buffalo.edu

How Learning Differs from Optimization - Welcome …srihari/CSE676/8.1...•How learning differs from optimization –Risk, empirical risk and surrogate loss –Batch, minibatch, data

Machine learning for Signature Verificationsrihari/talks/icvgip.pdf · Machine learning for Signature Verification Sargur Srihari (with Harish Srinivasan and Matthew Beal) Center

Introduction to Pattern Recognition - cedar.buffalo.edusrihari/CSE555/Chap1.Part1.pdf · CSE 555: Sargur Srihari 1 Introduction to Pattern Recognition Sargur N. Srihari srihari@cedar.buffalo.edu